Key Takeaways
- Research shows that DeepSeek LLM is an open source language model with 6.7 billion parameters and 2 trillion tokens of training data covering English and Chinese.
- The evidence points to strong performance in reasoning, coding, math, and Chinese comprehension, with particular strengths in the HumanEval and GSM8K benchmarks.
- It seems possible that its chat version outperforms GPT-3.5 on some tasks, especially in Chinese processing.
- Controversy exists, however, and its training data may contain biases, and the model size may limit its use on resource-constrained devices.
Introduction
What is DeepSeek LLM?
DeepSeek LLM is an open source language model developed by DeepSeek AI with 6.7 billion parameters and 2 trillion tokens of training data, covering English and Chinese. It is available in basic and chat versions, designed to support research and commercial applications, and performs particularly well in reasoning, coding, and mathematical tasks.Performance highlights
The study shows that DeepSeek LLM 67B base version outperforms Llama2 70B base version in multiple benchmarks, especially in Chinese comprehension and encoding tasks. The chat version achieves a pass@1 score of 73.8% on HumanEval and 84.1% on the GSM8K 0-shot test, showing strong math and coding capabilities. In addition, it scores 65 on the Hungarian National High School Exam, demonstrating excellent generalization capabilities.Users
can download DeepSeek LLM throughHugging Faceand use the Transformers library for text completion and chat interaction. Installation only requires runningpip install -r requirements.txt, which is suitable for Python 3.8 and above.An unexpected detail
Although DeepSeek LLM performs well among open source models, its chat version even surpasses GPT-3.5 in Chinese processing, which is rare in multilingual models.Comprehensive evaluation of DeepSeek LLM capabilities
Introduction and Background
Model details
- Number of parameters: 6.7 billion (with a 700 million parameter version)
- Training data: 2 trillion tokens, covering internet text, math, code, books, and self-collected data, respects robots.txt, removes personal privacy and copyright content
- Sequence length: 4096
- Architecture: Autoregressive Transformer Decoder model, similar to LLaMA, 7B model uses Multi-Head Attention (MHA), 67B model uses Grouped Query Attention (GQA)
Performance Evaluation
Model | HellaSwag | TriviaQA | MMLU | GSM8K | HumanEval | BBH | C-Eval | CMMLU | ChineseQA |
|---|---|---|---|---|---|---|---|---|---|
LLaMA-2 7B | 75.6 | 63.8 | 45.8 | 15.5 | 14.6 | 38.5 | 33.9 | 32.6 | 21.5 |
LLaMA-2 70B | 84.0 | 79.5 | 69.0 | 58.4 | 28.7 | 62.9 | 51.4 | 53.1 | 50.2 |
DeepSeek LLM 7B Base | 75.4 | 59.7 | 48.2 | 17.4 | 26.2 | 39.5 | 45.0 | 47.2 | 78.0 |
DeepSeek LLM 67B Base | 84.0 | 78.9 | 71.3 | 63.4 | 42.7 | 68.7 | 66.1 | 70.8 | 87.6 |
Model | TriviaQA | MMLU | GSM8K | HumanEval | BBH | C-Eval | CMMLU | ChineseQA |
|---|---|---|---|---|---|---|---|---|
DeepSeek LLM 7B Chat | 57.9 | 49.4 | 62.6 | 48.2 | 42.3 | 47.0 | 49.7 | 75.0 |
DeepSeek LLM 67B Chat | 81.5 | 71.1 | 84.1 | 73.8 | 71.7 | 65.2 | 67.8 | 85.1 |
- Never-seen-before exams: On the instruction following evaluation dataset released by Google, DeepSeek LLM 67B Chat performs well using prompt-level loose metrics evaluation.
- LeetCode Weekly Contest: Use questions from Weekly Contest 351-372 and Bi-Weekly Contest 108-117 (126 questions, each with more than 20 test cases) to evaluate coding ability. Specific score charts will be released soon.
Advantages and limitations
- Competitive Performance: DeepSeek LLM 67B Base outperforms LLaMA-2 70B on multiple benchmarks, especially on Chinese and encoding tasks.
- Multilingual capabilities: English and Chinese performance are balanced, with a ChineseQA score of 87.6%, suitable for global users.
- Open Source: The MIT License and model licenses that support commercial use promote transparency and community engagement.
- Coding and Math Skills: High scores on HumanEval and GSM8K show their usefulness in development and problem solving.
- Model size: 6.7 billion parameters require a lot of computational resources, limiting its deployment on resource-limited devices.
- Potential bias: The training data may contain biases from the network data, potentially generating discriminatory or inaccurate responses.
- Hallucination problem: The model may generate outputs that appear reasonable but are in fact wrong, affecting reliability.
- Duplication: Duplicate content may appear when generating responses, reducing output diversity.
Usage and Accessibility
Key issues and limitations
- 7B model on 1 NVIDIA A100-PCIE-40GB GPU with different batch sizes and sequence lengths, peak memory usage ranges from 13.29 GB to OOM.
- For the 67B model on 8 NVIDIA A100-PCIE-40GB GPUs, the peak memory went from 16.92 GB to OOM.
Critical perspectives and future directions
Model Download | Quick Start | Evaluation Results | License | Citation
Introducing DeepSeek LLM, an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. In order to foster research, we have made DeepSeek LLM 7B/67B Base and DeepSeek LLM 7B/67B Chat open source for the research community.
Superior General Capabilities: DeepSeek LLM 67B Base outperforms Llama2 70B Base in areas such as reasoning, coding, math, and Chinese comprehension.
Proficient in Coding and Math: DeepSeek LLM 67B Chat exhibits outstanding performance in coding (HumanEval Pass@1: 73.78) and mathematics (GSM8K 0-shot: 84.1, Math 0-shot: 32.6). It also demonstrates remarkable generalization abilities, as evidenced by its exceptional score of 65 on the Hungarian National High School Exam.
Mastery in Chinese Language: Based on our evaluation, DeepSeek LLM 67B Chat surpasses GPT-3.5 in Chinese.
We release the DeepSeek LLM 7B/67B, including both base and chat models, to the public. To support a broader and more diverse range of research within both academic and commercial communities, we are providing access to the intermediate checkpoints of the base model from its training process. Please note that the use of this model is subject to the terms outlined in License section. Commercial usage is permitted under these terms.
| Model | Sequence Length | Download |
|---|---|---|
| DeepSeek LLM 7B Base | 4096 | 🤗 HuggingFace |
| DeepSeek LLM 7B Chat | 4096 | 🤗 HuggingFace |
| DeepSeek LLM 67B Base | 4096 | 🤗 HuggingFace |
| DeepSeek LLM 67B Chat | 4096 | 🤗 HuggingFace |
We host the intermediate checkpoints of DeepSeek LLM 7B/67B on AWS S3 (Simple Storage Service). These files can be downloaded using the AWS Command Line Interface (CLI).
# using AWS CLI
# DeepSeek-LLM-7B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer
# DeepSeek-LLM-67B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-67B-Base <local_path> --recursive --request-payer
We evaluate our models and some baseline models on a series of representative benchmarks, both in English and Chinese. More results can be found in the evaluation folder. In this part, the evaluation results we report are based on the internal, non-open-source hai-llm evaluation framework. Please note that there may be slight discrepancies when using the converted HuggingFace models.
| model | Hella Swag | Trivia QA | MMLU | GSM8K | Human Eval | BBH | CEval | CMMLU | Chinese QA |
|---|---|---|---|---|---|---|---|---|---|
| 0-shot | 5-shot | 5-shot | 8-shot | 0-shot | 3-shot | 5-shot | 5-shot | 5-shot | |
| LLaMA-2 -7B | 75.6 | 63.8 | 45.8 | 15.5 | 14.6 | 38.5 | 33.9 | 32.6 | 21.5 |
| LLaMA-2 -70B | 84.0 | 79.5 | 69.0 | 58.4 | 28.7 | 62.9 | 51.4 | 53.1 | 50.2 |
| DeepSeek LLM 7B Base | 75.4 | 59.7 | 48.2 | 17.4 | 26.2 | 39.5 | 45.0 | 47.2 | 78.0 |
| DeepSeek LLM 67B Base | 84.0 | 78.9 | 71.3 | 63.4 | 42.7 | 68.7 | 66.1 | 70.8 | 87.6 |
Note: ChineseQA is an in-house benchmark, inspired by TriviaQA.
To address data contamination and tuning for specific testsets, we have designed fresh problem sets to assess the capabilities of open-source LLM models. The evaluation results indicate that DeepSeek LLM 67B Chat performs exceptionally well on never-before-seen exams.
Hungarian National High-School Exam: In line with Grok-1, we have evaluated the model's mathematical capabilities using the Hungarian National High School Exam. This exam comprises 33 problems, and the model's scores are determined through human annotation. We follow the scoring metric in the solution.pdf to evaluate all models.
Remark: We have rectified an error from our initial evaluation. In this revised version, we have omitted the lowest scores for questions 16, 17, 18, as well as for the aforementioned image. Evaluation details are here.
Instruction Following Evaluation: On Nov 15th, 2023, Google released an instruction following evaluation dataset. They identified 25 types of verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We use the prompt-level loose metric to evaluate all models. Here, we used the first version released by Google for the evaluation. For the Google revised test set evaluation results, please refer to the number in our paper.
LeetCode Weekly Contest: To assess the coding proficiency of the model, we have utilized problems from the LeetCode Weekly Contest (Weekly Contest 351-372, Bi-Weekly Contest 108-117, from July 2023 to Nov 2023). We have obtained these problems by crawling data from LeetCode, which consists of 126 problems with over 20 test cases for each. The evaluation metric employed is akin to that of HumanEval. In this regard, if a model's outputs successfully pass all test cases, the model is considered to have effectively solved the problem. The model's coding capabilities are depicted in the Figure below, where the y-axis represents the pass@1 score on in-domain human evaluation testing, and the x-axis represents the pass@1 score on out-domain LeetCode Weekly Contest problems.
The specific questions and test cases will be released soon. Stay tuned!
Standard Benchmark
| Model | TriviaQA | MMLU | GSM8K | HumanEval | BBH | C-Eval | CMMLU | ChineseQA |
|---|---|---|---|---|---|---|---|---|
| DeepSeek LLM 7B Base | 59.7 | 48.2 | 17.4 | 26.2 | 39.5 | 45.0 | 47.2 | 78.0 |
| DeepSeek LLM 67B Base | 78.9 | 71.3 | 63.4 | 42.7 | 68.7 | 66.1 | 70.8 | 87.6 |
| DeepSeek LLM 7B Chat | 57.9 | 49.4 | 62.6 | 48.2 | 42.3 | 47.0 | 49.7 | 75.0 |
| DeepSeek LLM 67B Chat | 81.5 | 71.1 | 84.1 | 73.8 | 71.7 | 65.2 | 67.8 | 85.1 |
Note: We evaluate chat models with 0-shot for MMLU, GSM8K, C-Eval, and CMMLU. More evaluation results can be found here.
Revisit Multi-Choice Question Benchmarks
Based on our experimental observations, we have discovered that enhancing benchmark performance using multi-choice (MC) questions, such as MMLU, CMMLU, and C-Eval, is a relatively straightforward task. By incorporating multi-choice questions from Chinese exams, we have achieved exceptional results, as depicted in the table below:
| Model | MMLU | C-Eval | CMMLU |
|---|---|---|---|
| DeepSeek LLM 7B Chat | 49.4 | 47.0 | 49.7 |
| DeepSeek LLM 7B Chat + MC | 60.9 | 71.3 | 73.8 |
Note: +MC represents the addition of 20 million Chinese multiple-choice questions collected from the web. It is important to note that we conducted deduplication for the C-Eval validation set and CMMLU test set to prevent data contamination. This addition not only improves Chinese multiple-choice benchmarks but also enhances English benchmarks. However, we observed that it does not enhance the model's knowledge performance on other evaluations that do not utilize the multiple-choice style in the 7B setting. As a result, we made the decision to not incorporate MC data in the pre-training or fine-tuning process, as it would lead to overfitting on benchmarks.




Comments
Post a Comment