Key Takeaways
- DeepSeek Coder is an open source code language model with 2 trillion tokens of training data, covering 87% of code and 13% of English and Chinese natural languages.
- The evidence favors outperforming other open source models on multiple benchmarks (e.g. HumanEval, MBPP, DS-1000), with model sizes ranging from 1B to 33B parameters.
- It seems possible that its instruction-tuned version outperforms GPT-3.5 Turbo on HumanEval, but it is not clear in comparisons with recent closed-source models such as GPT-4o.
- What is controversial is that its performance and natural language processing capabilities in actual coding scenarios may be limited and need further verification.
Overview
Model Features
- The scale of training data is huge : the training data reaches 2 trillion tags, of which 87% are codes and 13% are natural languages in English and Chinese, supporting multiple programming languages (such as Python, Java, C++, etc.).
- Flexibility and scalability : Models of different sizes, such as 1B, 5.7B, 6.7B and 33B, are provided, and users can choose according to their needs.
- Excellent benchmark performance : In tests such as HumanEval, MBPP, and DS-1000, the 33B base model significantly outperforms its open source competitor CodeLlama-34B, and the instruction-tuned version even outperforms GPT-3.5 Turbo on HumanEval.
- Practical application support : Supports project-level code completion and insertion, with a window size of up to 16K, suitable for processing larger code bases.
Potential for improvement
- Comparison with closed-source models : There is currently a lack of direct comparison with recent closed-source models (such as GPT-4o or Claude 3 Sonnet), and the performance gap is uncertain.
- Actual Scenario Performance : Although the benchmark test performed well, its performance in real coding environments (such as handling complex projects or multi-person collaboration) still needs to be verified.
- Natural language capabilities : Since natural language data accounts for only 13%, it may have limited performance in processing complex natural language instructions, especially in non-English or non-Chinese contexts.
An unexpected detail
Detailed Report
DeepSeek Coder is an open source code language model developed by DeepSeek, which aims to improve code generation, completion, and insertion capabilities through large-scale training data and multiple model scales. The following is a detailed analysis of it, covering training data, performance evaluation, usage methods, and potential room for improvement, striving to fully reflect its characteristics and limitations.Training data and model design
The total amount of DeepSeek Coder training data is 2 trillion tokens, consisting of 87% code and 13% natural language, covering English and Chinese. This large-scale dataset is obtained by collecting code data from GitHub and applying the same filtering rules as StarCoder data. The data quality is then ensured by parsing file dependencies, merging related files, and using repo-level minhash to remove duplicates. After further filtering, low-quality code with grammatical errors or poor readability is eliminated.
The model training was divided into two stages: first, preliminary pre-training was performed using 1.8 trillion tags and a 4K window size, with a data set consisting of 87% code, 10% code-related languages (such as GitHub Markdown and StackExchange), and 3% non-code-related Chinese; then the window size was expanded to 16K, and further pre-trained on an additional 20 billion tags to form a base model (DeepSeek-Coder-Base). After that, fine-tuning was performed using 2 billion instruction data to generate an instruction adjustment model (DeepSeek-Coder-Instruct).
The model is available in multiple sizes, including 1.3B, 5.7B, 6.7B and 33B parameters, to meet the needs of users with different hardware and requirements. The list of supported programming languages is extremely extensive, covering more than 80 languages in total, from common languages such as Python, Java, C++ to less common languages such as Ada and Agda.
Performance Evaluation
Benchmarks | DeepSeek-Coder-Base-33B | Compared with CodeLlama-34B, the leading percentage |
---|---|---|
HumanEval Python | Better than CodeLlama-34B | 7.9% |
HumanEval Multilingual | Better than CodeLlama-34B | 9.3% |
MBPP | Better than CodeLlama-34B | 10.8% |
DS-1000 | Better than CodeLlama-34B | 5.9% |
In addition, the model performs well in project-level code completion and insertion tasks with a window size of up to 16K, supporting processing of larger code bases. Examples include code completion (such as quick sort algorithm generation), code insertion (such as completing loop statements), and chat model reasoning (such as generating code based on natural language instructions).
How to use
- Code completion : Load the model through the Transformers library and enter#write a quick sort algorithmto generate a Python quick sort algorithm.
- Code insertion : Use<|fim begin|>and<|fim hole|>tags to support filling in blanks in the code.
- Chat model reasoning : supports natural language instructions, such as "write a quick sort algorithm in Python", and generates corresponding code.
- Repository-level code completion : handles multi-file dependencies, such ascalling functionsutils.pyandmodel.pythe training and evaluation logic ofmain.py.
Potential for improvement
- Comparison with closed-source models : There is currently a lack of direct comparisons with the latest closed-source models (such as GPT-4o or Claude 3 Sonnet in March 2025), and the performance gap is uncertain. Research shows that DeepSeek-Coder-V2 is comparable to these models in some tests, but the original DeepSeek Coder for user queries may still be behind.
- Performance in real-world scenarios : Benchmarks such as HumanEval and MBPP mainly evaluate code generation capabilities, but real coding environments (such as complex projects or multi-person collaboration) may require more validation. Evidence tends to show that it performs well in project-level tasks, but real-world application cases need to be supplemented.
- Natural language capabilities : Since natural language data accounts for only 13% and is mainly in English and Chinese, it may have limited performance when processing non-English or non-Chinese natural language commands, especially among global users.
- Efficiency and environmental impact : Training such a large-scale model may consume a lot of computing resources, and the environmental impact is worthy of attention. The study suggests that more efficient training methods can be explored in the future to reduce energy consumption.
License and Resources
In conclusion
Key Quotes
[ Homepage] | [🤖 Chat with DeepSeek Coder] | [🤗 Models Download] | [Discord] | [WeChat (微信)]
DeepSeek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and an extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, DeepSeek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.
Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
Highly Flexible & Scalable: Offered in model sizes of 1B, 5.7B, 6.7B and 33B, enabling users to choose the setup most suitable for their requirements.
Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.
['ada', 'agda', 'alloy', 'antlr', 'applescript', 'assembly', 'augeas', 'awk', 'batchfile', 'bluespec', 'c', 'c-sharp', 'clojure', 'cmake', 'coffeescript', 'common-lisp', 'cpp', 'css', 'cuda', 'dart', 'dockerfile', 'elixir', 'elm', 'emacs-lisp', 'erlang', 'f-sharp', 'fortran', 'glsl', 'go', 'groovy', 'haskell', 'html', 'idris', 'isabelle', 'java', 'java-server-pages', 'javascript', 'json', 'julia', 'jupyter-notebook', 'kotlin', 'lean', 'literate-agda', 'literate-coffeescript', 'literate-haskell', 'lua', 'makefile', 'maple', 'markdown', 'mathematica', 'matlab', 'ocaml', 'pascal', 'perl', 'php', 'powershell', 'prolog', 'protocol-buffer', 'python', 'r', 'racket', 'restructuredtext', 'rmarkdown', 'ruby', 'rust', 'sas', 'scala', 'scheme', 'shell', 'smalltalk', 'solidity', 'sparql', 'sql', 'stan', 'standard-ml', 'stata', 'systemverilog', 'tcl', 'tcsh', 'tex', 'thrift', 'typescript', 'verilog', 'vhdl', 'visual-basic', 'xslt', 'yacc', 'yaml', 'zig']
We evaluate DeepSeek Coder on various coding-related benchmarks. Only pass@1
results on HumanEval (Python and Multilingual), MBPP, and DS-1000 are reported here:
The result shows that DeepSeek-Coder-Base-33B significantly outperforms existing open-source code LLMs. Compared with CodeLlama-34B, it leads by 7.9%, 9.3%, 10.8% and 5.9% respectively on HumanEval Python, HumanEval Multilingual, MBPP and DS-1000. Surprisingly, our DeepSeek-Coder-Base-7B reaches the performance of CodeLlama-34B. The DeepSeek-Coder-Instruct-33B model after instruction tuning outperforms GPT35-turbo on HumanEval and achieves comparable results with GPT35-turbo on MBPP.
More evaluation details can be found in the Detailed Evaluation.
- Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data.
- Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies.
- Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication.
- Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.
- Step 1: Initially pre-trained with a dataset consisting of 87% code, 10% code-related language (Github Markdown and StackExchange), and 3% non-code-related Chinese language. Models are pre-trained using 1.8T tokens and a 4K window size in this step.
- Step 2: Further Pre-training using an extended 16K window size on an additional 200B tokens, resulting in foundational models (DeepSeek-Coder-Base).
- Step 3: Instruction Fine-tuning on 2B tokens of instruction data, resulting in instruction-tuned models (DeepSeek-Coder-Instruct).
Comments
Post a Comment