Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B
Phind is a free AI search engine for developers and tech enthusiasts. It offers customizable searches, related topics, and “Surprise Me” options for fun and engaging searches.
We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.
The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.
CodeLlama-34B achieved 48.8% pass@1 on HumanEval CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval
We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.
Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples. The methodology is:
For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters. A match was identified if any sampled substring was a substring of the processed training example.
For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report. Presented below are the pass@1 scores we achieved with our fine-tuned models:
Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval