Comprehensive evaluation of DeepSeek LLM capabilities

Key Takeaways

  • Research shows that DeepSeek LLM is an open source language model with 6.7 billion parameters and 2 trillion tokens of training data covering English and Chinese.
  • The evidence points to strong performance in reasoning, coding, math, and Chinese comprehension, with particular strengths in the HumanEval and GSM8K benchmarks.
  • It seems possible that its chat version outperforms GPT-3.5 on some tasks, especially in Chinese processing.
  • Controversy exists, however, and its training data may contain biases, and the model size may limit its use on resource-constrained devices.

Introduction

What is DeepSeek LLM?

DeepSeek LLM is an open source language model developed by DeepSeek AI with 6.7 billion parameters and 2 trillion tokens of training data, covering English and Chinese. It is available in basic and chat versions, designed to support research and commercial applications, and performs particularly well in reasoning, coding, and mathematical tasks.

Performance highlights

The study shows that DeepSeek LLM 67B base version outperforms Llama2 70B base version in multiple benchmarks, especially in Chinese comprehension and encoding tasks. The chat version achieves a pass@1 score of 73.8% on HumanEval and 84.1% on the GSM8K 0-shot test, showing strong math and coding capabilities. In addition, it scores 65 on the Hungarian National High School Exam, demonstrating excellent generalization capabilities.

Users

can download DeepSeek LLM throughHugging Faceand use the Transformers library for text completion and chat interaction. Installation only requires runningpip install -r requirements.txt, which is suitable for Python 3.8 and above.

An unexpected detail

Although DeepSeek LLM performs well among open source models, its chat version even surpasses GPT-3.5 in Chinese processing, which is rare in multilingual models.

Comprehensive evaluation of DeepSeek LLM capabilities

DeepSeek LLM is an open source language model developed by DeepSeek AI that aims to provide competitive performance in the fields of reasoning, encoding, mathematics, and language understanding with its 6.7 billion parameters and 2 trillion labeled training data. This evaluation is as of 21:19 EST on March 2, 2025, and aims to provide an in-depth analysis for both technical and non-technical readers, covering its background, performance, strengths and limitations, using a professional writing style, and strives to be comprehensive and objective.

Introduction and Background


DeepSeek LLM was released in November 2023. It is based on training from scratch. The data set includes 2 trillion English and Chinese tokens. It aims to advance the research of open source language models. Its goal is to provide a basic version (Base) and a chat version (Chat), which are suitable for different scenarios. The basic version lays the foundation for research, and the chat version enhances interaction capabilities through supervised fine-tuning and direct preference optimization (DPO). The model is open source on GitHub , the code adopts the MIT license, the use of the model is subject to the model license, and commercial use is supported.

Model details


The key technical parameters of DeepSeek LLM are as follows:
  • Number of parameters: 6.7 billion (with a 700 million parameter version)
  • Training data: 2 trillion tokens, covering internet text, math, code, books, and self-collected data, respects robots.txt, removes personal privacy and copyright content
  • Sequence length: 4096
  • Architecture: Autoregressive Transformer Decoder model, similar to LLaMA, 7B model uses Multi-Head Attention (MHA), 67B model uses Grouped Query Attention (GQA)
The training process uses the AdamW optimizer with a batch size of 2304 and a learning rate of 4.2e-4 for the 7B model and a batch size of 4608 and a learning rate of 3.2e-4 for the 67B model. The learning rate schedule includes a 2000-step warm-up, followed by a reduction to 31.6% and 10% of the maximum value at the 1.6 trillion and 1.8 trillion marks, respectively.
Data processing is enhanced by a distributed batch processing system called “cc_cleaner”, which uses RefinedWeb + CCNet as the basis, implements heuristic rules and model filtering to remove low-quality content, and uses MinhashLSH for strict deduplication to ensure data uniqueness and integrity.

Performance Evaluation


DeepSeek LLM was evaluated on multiple benchmarks covering reasoning, encoding, math, and language understanding, including LLaMA-2 7B and 70B. Here are the key results:
Comparison between the base model and LLaMA-2:
Model
HellaSwag
TriviaQA
MMLU
GSM8K
HumanEval
BBH
C-Eval
CMMLU
ChineseQA
LLaMA-2 7B
75.6
63.8
45.8
15.5
14.6
38.5
33.9
32.6
21.5
LLaMA-2 70B
84.0
79.5
69.0
58.4
28.7
62.9
51.4
53.1
50.2
DeepSeek LLM 7B Base
75.4
59.7
48.2
17.4
26.2
39.5
45.0
47.2
78.0
DeepSeek LLM 67B Base
84.0
78.9
71.3
63.4
42.7
68.7
66.1
70.8
87.6
Note: The above results are based on the internal non-open source hai-llm evaluation framework and may differ slightly from the HuggingFace conversion model.
As can be seen from the table, DeepSeek LLM 67B Base outperforms LLaMA-2 70B on MMLU, GSM8K, HumanEval, BBH, C-Eval, CMMLU and ChineseQA, especially on Chinese-related tasks (such as ChineseQA 87.6% vs. 50.2%) and encoding tasks (such as HumanEval 42.7% vs. 28.7%).

Chat model performance:
The chat version is further optimized through supervised fine-tuning and DPO. The 0-shot evaluation results are as follows:
Model
TriviaQA
MMLU
GSM8K
HumanEval
BBH
C-Eval
CMMLU
ChineseQA
DeepSeek LLM 7B Chat
57.9
49.4
62.6
48.2
42.3
47.0
49.7
75.0
DeepSeek LLM 67B Chat
81.5
71.1
84.1
73.8
71.7
65.2
67.8
85.1
The chat version achieved 73.8% pass@1 on HumanEval and 84.1% on GSM8K, demonstrating strong math and coding skills. In addition, it scored 65 on the Hungarian National High School Exam, indicating strong mathematical reasoning skills.

Other assessments:

  • Never-seen-before exams: On the instruction following evaluation dataset released by Google, DeepSeek LLM 67B Chat performs well using prompt-level loose metrics evaluation.
  • LeetCode Weekly Contest: Use questions from Weekly Contest 351-372 and Bi-Weekly Contest 108-117 (126 questions, each with more than 20 test cases) to evaluate coding ability. Specific score charts will be released soon.

Advantages and limitations


Advantages:

  1. Competitive Performance: DeepSeek LLM 67B Base outperforms LLaMA-2 70B on multiple benchmarks, especially on Chinese and encoding tasks.
  2. Multilingual capabilities: English and Chinese performance are balanced, with a ChineseQA score of 87.6%, suitable for global users.
  3. Open Source: The MIT License and model licenses that support commercial use promote transparency and community engagement.
  4. Coding and Math Skills: High scores on HumanEval and GSM8K show their usefulness in development and problem solving.
limitation:

  1. Model size: 6.7 billion parameters require a lot of computational resources, limiting its deployment on resource-limited devices.
  2. Potential bias: The training data may contain biases from the network data, potentially generating discriminatory or inaccurate responses.
  3. Hallucination problem: The model may generate outputs that appear reasonable but are in fact wrong, affecting reliability.
  4. Duplication: Duplicate content may appear when generating responses, reducing output diversity.

Usage and Accessibility


DeepSeek LLM can be downloaded from Hugging Face and supports inference of Transformers library. Installation requires Python 3.8 and above, run pip install -r requirements.txt .

In addition, vLLM is supported for high-throughput inference, suitable for large-scale deployment.

Key issues and limitations


Users often ask about model quantization. DeepSeek LLM uses HuggingFace Tokenizer to implement the Byte-level BPE algorithm, which cannot be directly converted to SentencePiece tokenizer. However, the team is contributing open source quantization methods to support GGUF and GPTQ.
GPU memory usage:
  • 7B model on 1 NVIDIA A100-PCIE-40GB GPU with different batch sizes and sequence lengths, peak memory usage ranges from 13.29 GB to OOM.
  • For the 67B model on 8 NVIDIA A100-PCIE-40GB GPUs, the peak memory went from 16.92 GB to OOM.

Critical perspectives and future directions


From a critical perspective, DeepSeek LLM is an important advancement in the field of open source AI, and its competitive performance and multilingual capabilities provide a powerful tool for research and development. However, compared with the latest closed-source models (such as GPT-4o in March 2025), its performance may be lagging behind, and needs to be improved by expanding the model size and improving training data. Future directions include integrating it into educational platforms to assist learning and research, while ensuring safe and ethical use to prevent misuse.

In conclusion

DeepSeek LLM is an important contribution to the open source AI ecosystem. Its performance on benchmarks and multilingual capabilities make it a valuable tool for researchers and developers. Despite challenges such as model size and bias, its open source nature and community support have great potential and may further promote AI innovation in the future.

Key Quotes

Model Download | Quick Start | Evaluation Results | License | Citation

Paper Link👁️

1. Introduction

Introducing DeepSeek LLM, an advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. In order to foster research, we have made DeepSeek LLM 7B/67B Base and DeepSeek LLM 7B/67B Chat open source for the research community.

result
  • Superior General Capabilities: DeepSeek LLM 67B Base outperforms Llama2 70B Base in areas such as reasoning, coding, math, and Chinese comprehension.

  • Proficient in Coding and Math: DeepSeek LLM 67B Chat exhibits outstanding performance in coding (HumanEval Pass@1: 73.78) and mathematics (GSM8K 0-shot: 84.1, Math 0-shot: 32.6). It also demonstrates remarkable generalization abilities, as evidenced by its exceptional score of 65 on the Hungarian National High School Exam.

  • Mastery in Chinese Language: Based on our evaluation, DeepSeek LLM 67B Chat surpasses GPT-3.5 in Chinese.

2. Model Downloads

We release the DeepSeek LLM 7B/67B, including both base and chat models, to the public. To support a broader and more diverse range of research within both academic and commercial communities, we are providing access to the intermediate checkpoints of the base model from its training process. Please note that the use of this model is subject to the terms outlined in License section. Commercial usage is permitted under these terms.

Huggingface

ModelSequence LengthDownload
DeepSeek LLM 7B Base4096🤗 HuggingFace
DeepSeek LLM 7B Chat4096🤗 HuggingFace
DeepSeek LLM 67B Base4096🤗 HuggingFace
DeepSeek LLM 67B Chat4096🤗 HuggingFace

Intermediate Checkpoints

We host the intermediate checkpoints of DeepSeek LLM 7B/67B on AWS S3 (Simple Storage Service). These files can be downloaded using the AWS Command Line Interface (CLI).

# using AWS CLI

# DeepSeek-LLM-7B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer

# DeepSeek-LLM-67B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-67B-Base <local_path> --recursive  --request-payer

3. Evaluation Results

Base Model

We evaluate our models and some baseline models on a series of representative benchmarks, both in English and Chinese. More results can be found in the evaluation folder. In this part, the evaluation results we report are based on the internal, non-open-source hai-llm evaluation framework. Please note that there may be slight discrepancies when using the converted HuggingFace models.

modelHella
Swag
Trivia
QA
MMLUGSM8KHuman
Eval
BBHCEvalCMMLUChinese
QA
0-shot5-shot5-shot8-shot0-shot3-shot5-shot5-shot5-shot
LLaMA-2
-7B
75.663.845.815.514.638.533.932.621.5
LLaMA-2
-70B
84.079.569.058.428.762.951.453.150.2
DeepSeek LLM
7B Base
75.459.748.217.426.239.545.047.278.0
DeepSeek LLM
67B Base
84.078.971.363.442.768.766.170.887.6

Note: ChineseQA is an in-house benchmark, inspired by TriviaQA.

Chat Model

Never Seen Before Exam

To address data contamination and tuning for specific testsets, we have designed fresh problem sets to assess the capabilities of open-source LLM models. The evaluation results indicate that DeepSeek LLM 67B Chat performs exceptionally well on never-before-seen exams.


Hungarian National High-School Exam: In line with Grok-1, we have evaluated the model's mathematical capabilities using the Hungarian National High School Exam. This exam comprises 33 problems, and the model's scores are determined through human annotation. We follow the scoring metric in the solution.pdf to evaluate all models.

result

Remark: We have rectified an error from our initial evaluation. In this revised version, we have omitted the lowest scores for questions 16, 17, 18, as well as for the aforementioned image. Evaluation details are here.


Instruction Following Evaluation: On Nov 15th, 2023, Google released an instruction following evaluation dataset. They identified 25 types of verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We use the prompt-level loose metric to evaluate all models. Here, we used the first version released by Google for the evaluation. For the Google revised test set evaluation results, please refer to the number in our paper.

result

LeetCode Weekly Contest: To assess the coding proficiency of the model, we have utilized problems from the LeetCode Weekly Contest (Weekly Contest 351-372, Bi-Weekly Contest 108-117, from July 2023 to Nov 2023). We have obtained these problems by crawling data from LeetCode, which consists of 126 problems with over 20 test cases for each. The evaluation metric employed is akin to that of HumanEval. In this regard, if a model's outputs successfully pass all test cases, the model is considered to have effectively solved the problem. The model's coding capabilities are depicted in the Figure below, where the y-axis represents the pass@1 score on in-domain human evaluation testing, and the x-axis represents the pass@1 score on out-domain LeetCode Weekly Contest problems.

result

The specific questions and test cases will be released soon. Stay tuned!


Standard Benchmark

ModelTriviaQAMMLUGSM8KHumanEvalBBHC-EvalCMMLUChineseQA
DeepSeek LLM 7B Base59.748.217.426.239.545.047.278.0
DeepSeek LLM 67B Base78.971.363.442.768.766.170.887.6
DeepSeek LLM 7B Chat57.949.462.648.242.347.049.775.0
DeepSeek LLM 67B Chat81.571.184.173.871.765.267.885.1

Note: We evaluate chat models with 0-shot for MMLU, GSM8K, C-Eval, and CMMLU. More evaluation results can be found here.

Revisit Multi-Choice Question Benchmarks

Based on our experimental observations, we have discovered that enhancing benchmark performance using multi-choice (MC) questions, such as MMLU, CMMLU, and C-Eval, is a relatively straightforward task. By incorporating multi-choice questions from Chinese exams, we have achieved exceptional results, as depicted in the table below:

ModelMMLUC-EvalCMMLU
DeepSeek LLM 7B Chat49.447.049.7
DeepSeek LLM 7B Chat + MC60.971.373.8

Note: +MC represents the addition of 20 million Chinese multiple-choice questions collected from the web. It is important to note that we conducted deduplication for the C-Eval validation set and CMMLU test set to prevent data contamination. This addition not only improves Chinese multiple-choice benchmarks but also enhances English benchmarks. However, we observed that it does not enhance the model's knowledge performance on other evaluations that do not utilize the multiple-choice style in the 7B setting. As a result, we made the decision to not incorporate MC data in the pre-training or fine-tuning process, as it would lead to overfitting on benchmarks.

Comments