DeepSeek Coder: A comprehensive evaluation guide for open source code models

 Key Takeaways

  • DeepSeek Coder is an open source code language model with 2 trillion tokens of training data, covering 87% of code and 13% of English and Chinese natural languages.
  • The evidence favors outperforming other open source models on multiple benchmarks (e.g. HumanEval, MBPP, DS-1000), with model sizes ranging from 1B to 33B parameters.
  • It seems possible that its instruction-tuned version outperforms GPT-3.5 Turbo on HumanEval, but it is not clear in comparisons with recent closed-source models such as GPT-4o.
  • What is controversial is that its performance and natural language processing capabilities in actual coding scenarios may be limited and need further verification.

Overview

DeepSeek Coder is an open source code language model developed by DeepSeek, which aims to improve code generation and completion capabilities. The following is a brief analysis of its main features and potential improvements, suitable for ordinary users to understand.

Model Features


  • The scale of training data is huge : the training data reaches 2 trillion tags, of which 87% are codes and 13% are natural languages ​​in English and Chinese, supporting multiple programming languages ​​(such as Python, Java, C++, etc.).
  • Flexibility and scalability : Models of different sizes, such as 1B, 5.7B, 6.7B and 33B, are provided, and users can choose according to their needs.
  • Excellent benchmark performance : In tests such as HumanEval, MBPP, and DS-1000, the 33B base model significantly outperforms its open source competitor CodeLlama-34B, and the instruction-tuned version even outperforms GPT-3.5 Turbo on HumanEval.
  • Practical application support : Supports project-level code completion and insertion, with a window size of up to 16K, suitable for processing larger code bases.

Potential for improvement


  • Comparison with closed-source models : There is currently a lack of direct comparison with recent closed-source models (such as GPT-4o or Claude 3 Sonnet), and the performance gap is uncertain.
  • Actual Scenario Performance : Although the benchmark test performed well, its performance in real coding environments (such as handling complex projects or multi-person collaboration) still needs to be verified.
  • Natural language capabilities : Since natural language data accounts for only 13%, it may have limited performance in processing complex natural language instructions, especially in non-English or non-Chinese contexts.

An unexpected detail


Although DeepSeek Coder focuses on code generation, its instruction-adapted version can also handle chat model tasks, such as generating code based on natural language instructions, which provides additional flexibility to users.

Detailed Report

DeepSeek Coder is an open source code language model developed by DeepSeek, which aims to improve code generation, completion, and insertion capabilities through large-scale training data and multiple model scales. The following is a detailed analysis of it, covering training data, performance evaluation, usage methods, and potential room for improvement, striving to fully reflect its characteristics and limitations.

Training data and model design


The total amount of DeepSeek Coder training data is 2 trillion tokens, consisting of 87% code and 13% natural language, covering English and Chinese. This large-scale dataset is obtained by collecting code data from GitHub and applying the same filtering rules as StarCoder data. The data quality is then ensured by parsing file dependencies, merging related files, and using repo-level minhash to remove duplicates. After further filtering, low-quality code with grammatical errors or poor readability is eliminated.
The model training was divided into two stages: first, preliminary pre-training was performed using 1.8 trillion tags and a 4K window size, with a data set consisting of 87% code, 10% code-related languages ​​(such as GitHub Markdown and StackExchange), and 3% non-code-related Chinese; then the window size was expanded to 16K, and further pre-trained on an additional 20 billion tags to form a base model (DeepSeek-Coder-Base). After that, fine-tuning was performed using 2 billion instruction data to generate an instruction adjustment model (DeepSeek-Coder-Instruct).

The model is available in multiple sizes, including 1.3B, 5.7B, 6.7B and 33B parameters, to meet the needs of users with different hardware and requirements. The list of supported programming languages ​​is extremely extensive, covering more than 80 languages ​​in total, from common languages ​​such as Python, Java, C++ to less common languages ​​such as Ada and Agda.

Performance Evaluation


DeepSeek Coder performs well in multiple coding-related benchmarks. Here is a summary of some pass@1 results:
Benchmarks
DeepSeek-Coder-Base-33B
Compared with CodeLlama-34B, the leading percentage
HumanEval Python
Better than CodeLlama-34B
7.9%
HumanEval Multilingual
Better than CodeLlama-34B
9.3%
MBPP
Better than CodeLlama-34B
10.8%
DS-1000
Better than CodeLlama-34B
5.9%
It is particularly noteworthy that DeepSeek-Coder-Base-7B achieves the performance of CodeLlama-34B, while DeepSeek-Coder-Instruct-33B outperforms GPT-3.5 Turbo on HumanEval and performs on par with MBPP. These results indicate that it is in a leading position among open source code models.
In addition, the model performs well in project-level code completion and insertion tasks with a window size of up to 16K, supporting processing of larger code bases. Examples include code completion (such as quick sort algorithm generation), code insertion (such as completing loop statements), and chat model reasoning (such as generating code based on natural language instructions).

How to use


Before using DeepSeek Coder, you need to install the dependencies, which can be done through pip install -r requirements.txt . An online demo of Hugging Face Space is provided, which can also be run locally through app.py in the demo folder . Usage examples include:
  1. Code completion : Load the model through the Transformers library and enter#write a quick sort algorithmto generate a Python quick sort algorithm.
  2. Code insertion : Use<|fim begin|>and<|fim hole|>tags to support filling in blanks in the code.
  3. Chat model reasoning : supports natural language instructions, such as "write a quick sort algorithm in Python", and generates corresponding code.
  4. Repository-level code completion : handles multi-file dependencies, such ascalling functionsutils.pyandmodel.pythe training and evaluation logic ofmain.py.
In addition, users can use the provided script finetune/finetune_deepseekcoder.py to perform fine-tuning, which supports DeepSpeed ​​acceleration. Training data in JSON format (including instruction and output fields) must be prepared.

Potential for improvement


Although DeepSeek Coder performs well in the open source field, there is still room for improvement:
  • Comparison with closed-source models : There is currently a lack of direct comparisons with the latest closed-source models (such as GPT-4o or Claude 3 Sonnet in March 2025), and the performance gap is uncertain. Research shows that DeepSeek-Coder-V2 is comparable to these models in some tests, but the original DeepSeek Coder for user queries may still be behind.
  • Performance in real-world scenarios : Benchmarks such as HumanEval and MBPP mainly evaluate code generation capabilities, but real coding environments (such as complex projects or multi-person collaboration) may require more validation. Evidence tends to show that it performs well in project-level tasks, but real-world application cases need to be supplemented.
  • Natural language capabilities : Since natural language data accounts for only 13% and is mainly in English and Chinese, it may have limited performance when processing non-English or non-Chinese natural language commands, especially among global users.
  • Efficiency and environmental impact : Training such a large-scale model may consume a lot of computing resources, and the environmental impact is worthy of attention. The study suggests that more efficient training methods can be explored in the future to reduce energy consumption.

License and Resources


The code repository of this model adopts the MIT license, and the use of the model is subject to the model license, which supports commercial use. Users can obtain more resources through DeepSeek Coder GitHub , including quantitative support (such as GGUF and GPTQ) and community project list awesome-deepseek-coder .

In conclusion


DeepSeek Coder is a powerful open source code model that is particularly suitable for users who need code generation and completion. It outperforms many open source competitors in benchmarks and provides flexible model scale and usage. However, there is still room for improvement in comparison with the latest closed source models, actual scene performance, and natural language capabilities. In the future, we look forward to the DeepSeek team continuing to innovate and promote the application of models in the real world.

Key Quotes

DeepSeek Coder

[ Homepage] | [🤖 Chat with DeepSeek Coder] | [🤗 Models Download] | [Discord] | [WeChat (微信)]

Paper Link👁️


1. Introduction of DeepSeek Coder

DeepSeek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and an extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, DeepSeek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

result

  • Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.

  • Highly Flexible & Scalable: Offered in model sizes of 1B, 5.7B, 6.7B and 33B, enabling users to choose the setup most suitable for their requirements.

  • Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.

  • Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.

Supported Programming Languages

['ada', 'agda', 'alloy', 'antlr', 'applescript', 'assembly', 'augeas', 'awk', 'batchfile', 'bluespec', 'c', 'c-sharp', 'clojure', 'cmake', 'coffeescript', 'common-lisp', 'cpp', 'css', 'cuda', 'dart', 'dockerfile', 'elixir', 'elm', 'emacs-lisp', 'erlang', 'f-sharp', 'fortran', 'glsl', 'go', 'groovy', 'haskell', 'html', 'idris', 'isabelle', 'java', 'java-server-pages', 'javascript', 'json', 'julia', 'jupyter-notebook', 'kotlin', 'lean', 'literate-agda', 'literate-coffeescript', 'literate-haskell', 'lua', 'makefile', 'maple', 'markdown', 'mathematica', 'matlab', 'ocaml', 'pascal', 'perl', 'php', 'powershell', 'prolog', 'protocol-buffer', 'python', 'r', 'racket', 'restructuredtext', 'rmarkdown', 'ruby', 'rust', 'sas', 'scala', 'scheme', 'shell', 'smalltalk', 'solidity', 'sparql', 'sql', 'stan', 'standard-ml', 'stata', 'systemverilog', 'tcl', 'tcsh', 'tex', 'thrift', 'typescript', 'verilog', 'vhdl', 'visual-basic', 'xslt', 'yacc', 'yaml', 'zig']

2. Evaluation Results

We evaluate DeepSeek Coder on various coding-related benchmarks. Only pass@1 results on HumanEval (Python and Multilingual), MBPP, and DS-1000 are reported here:

table

The result shows that DeepSeek-Coder-Base-33B significantly outperforms existing open-source code LLMs. Compared with CodeLlama-34B, it leads by 7.9%, 9.3%, 10.8% and 5.9% respectively on HumanEval Python, HumanEval Multilingual, MBPP and DS-1000. Surprisingly, our DeepSeek-Coder-Base-7B reaches the performance of CodeLlama-34B. The DeepSeek-Coder-Instruct-33B model after instruction tuning outperforms GPT35-turbo on HumanEval and achieves comparable results with GPT35-turbo on MBPP.

More evaluation details can be found in the Detailed Evaluation.

3. Procedure of Data Creation and Model Training

Data Creation

  • Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data.
  • Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies.
  • Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication.
  • Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.

data_creation

Model Training

  • Step 1: Initially pre-trained with a dataset consisting of 87% code, 10% code-related language (Github Markdown and StackExchange), and 3% non-code-related Chinese language. Models are pre-trained using 1.8T tokens and a 4K window size in this step.
  • Step 2: Further Pre-training using an extended 16K window size on an additional 200B tokens, resulting in foundational models (DeepSeek-Coder-Base).
  • Step 3: Instruction Fine-tuning on 2B tokens of instruction data, resulting in instruction-tuned models (DeepSeek-Coder-Instruct).


Comments