DeepSeek Math: Leading Open-Source AI for Mathematical Reasoning

Key Points
  • Research suggests DeepSeek Math, an open-source AI model, performs well on mathematical reasoning, achieving 51.7% on the MATH benchmark, close to some top models from 2024.
  • It seems likely that its tool use feature enhances performance, with the RL version reaching up to 60% on MATH, surpassing other open-source models.
  • The evidence leans toward it being based on DeepSeek-Coder-v1.5 7B, trained on 500B math-related tokens, and available in base, instruct, and RL versions.
  • There is controversy around whether tool use reflects true understanding or reliance on external help, impacting its practical application.

Overview
DeepSeek Math is an open-source AI model designed for mathematical reasoning, showing promising results, especially for its size. It’s built on DeepSeek-Coder-v1.5 7B and trained on 500 billion math-related tokens, making it good at both math and coding tasks. You can use it in base, instruct, or RL versions, and it’s easy to get started with tools like Hugging Face’s Transformers.
Performance and Features
It achieves 51.7% on the MATH benchmark, which is competitive with models like Gemini-Ultra (around 53.2%) and the original GPT-4 (around 52.9%) from 2024, according to Gemini Ultra vs GPT-4 math benchmark 2024. The RL version can reach up to 60% on MATH with tool use, beating other open-source models, which is helpful for real-world math problems. An unexpected detail is that while it does well on benchmarks, there’s debate about whether its tool use means it truly understands math or just relies on external help, which could affect its use in education.
Usage and Accessibility
It’s user-friendly, with code under the MIT license and a model license that allows commercial use. You can download it from Hugging Face and use it for text completion or chat, like solving integrals step by step. Check out the official website DeepSeek Math for more details.


Comprehensive Evaluation of DeepSeek Math
DeepSeek Math is an open-source AI model specifically designed for mathematical reasoning, building on the foundation of DeepSeek-Coder-v1.5 7B and further trained on math-related data. This evaluation, conducted as of 09:05 PM EST on Sunday, March 2, 2025, provides a detailed analysis of its performance, features, and potential, aiming to offer a thorough understanding for both technical and non-technical audiences. The assessment is inspired by a critical and forward-looking perspective, akin to industry leaders, while maintaining a balanced view of its strengths and limitations.
Introduction and Background
DeepSeek Math is initialized from DeepSeek-Coder-v1.5 7B, a code-focused language model, and undergoes continued pre-training on 500 billion tokens, including math-related data sourced from Common Crawl, alongside natural language and code data. This process aims to enhance its mathematical reasoning capabilities, making it suitable for tasks ranging from problem-solving to formal theorem proving. The model is released in three variants: base, instruct, and RL (reinforcement learning), each catering to different use cases, and is available for public use under the MIT license for code and a separate model license that supports commercial applications.
Performance Evaluation
The model’s performance has been rigorously assessed across multiple benchmarks, with a focus on mathematical reasoning, tool use, and general capabilities. The key results are summarized below, based on the provided data and external benchmarks for context.
Mathematical Reasoning
  • On the competition-level MATH benchmark, DeepSeekMath-Base 7B achieves an impressive 51.7% accuracy without external toolkits or voting techniques. This score is notable for a 7B parameter model, as it approaches the performance of models like Gemini-Ultra (reported at 53.2% in 2024) and the original GPT-4 (around 52.9% in 2024), according to Gemini Ultra vs GPT-4 math benchmark 2024. However, it lags behind more recent closed-source models like GPT-4o, which scores 76.6% on MATH, as per LLM Benchmarks in 2024.
  • The model excels in few-shot chain-of-thought prompting, outperforming existing open-source base models by more than 10% in absolute terms and surpassing Minerva 540B, a larger model, on this metric.
Tool Use and Enhanced Capabilities
  • DeepSeekMath-RL 7B, trained with the Group Relative Policy Optimization (GRPO) algorithm, demonstrates enhanced performance with tool use, approaching 60% accuracy on the MATH benchmark. This surpasses all existing open-source models, highlighting its ability to leverage external tools like code execution or calculators to solve problems. This feature is particularly valuable for real-world applications where AI systems need to interact with their environment.
  • The base model’s continued pre-training with DeepSeekCoder-Base-7B-v1.5 enables effective problem-solving by writing programs, which is a strength for tasks requiring symbolic manipulation.
General Capabilities
  • Beyond mathematics, DeepSeekMath-Base 7B shows comparable performance in natural language understanding, reasoning, and coding, aligning with its predecessor, DeepSeek-Coder-Base-7B-v1.5. This versatility makes it suitable for a range of tasks, including text completion and chat-based interactions.
The evaluation results are summarized in the following tables for clarity:
Benchmark
DeepSeekMath-Base 7B
Comparison (Context)
MATH (Competition-Level)
51.7%
Approaches Gemini-Ultra (53.2%), lags GPT-4o (76.6%)
Few-Shot Chain-of-Thought
>10% better than open-source
Surpasses Minerva 540B
Tool Use (RL Version)
Up to 60% on MATH
Surpasses all existing open-source models
Capability
Performance Notes
Natural Language Understanding
Comparable to DeepSeek-Coder-Base-7B-v1.5
Reasoning
Strong, especially with step-by-step prompting
Coding
Effective, leveraging code generation capabilities
Data Collection and Training
The training data for DeepSeek Math is meticulously curated through a multi-step process:
  1. Selection of OpenWebMath, a high-quality mathematical web text collection, as the initial seed corpus for training a FastText model.
  2. Use of the FastText model to retrieve mathematical web pages from the deduplicated Common Crawl database.
  3. Identification of math-related domains through statistical analysis, followed by manual annotation of URLs associated with mathematical content.
  4. Iterative expansion of the seed corpus by adding linked web pages, repeated for four iterations, resulting in 35.5 million mathematical web pages totaling 120 billion tokens.
This process ensures a robust dataset focused on mathematical content, enhancing the model’s ability to handle diverse problem types.
Usage and Accessibility
DeepSeek Math is user-friendly, with support for inference through Hugging Face’s Transformers library. Examples include:
  • Text Completion: For instance, inputting “The integral of x^2 from 0 to 2 is” generates a step-by-step solution, as shown in the quick start guide.
  • Chat Completion: Users can interact with the instruct or RL versions using a chat template, such as asking, “What is the integral of x^2 from 0 to 2? Please reason step by step, and put your final answer within \boxed{}.” The model responds with detailed reasoning, enhancing its educational utility.
Model downloads are available on Hugging Face, with sequence length of 4096 for all variants (base, instruct, and RL). The quick start guide provides code snippets for both text and chat completion, making it accessible for developers and researchers.
Strengths and Limitations
Strengths:
  • Open-Source Nature: The MIT license for code and commercial-friendly model license foster transparency and community involvement, crucial for ethical AI development.
  • Competitive Performance: Achieving 51.7% on MATH with 7B parameters is a significant achievement, especially compared to larger open-source models.
  • Tool Use Capability: The RL version’s ability to reach 60% on MATH with tool use is a notable advancement, enhancing practical applicability.
  • Versatility: Its foundation in a code model ensures strong performance in coding-related math tasks, broadening its use cases.
Limitations:
  • Model Size: At 7B parameters, it is smaller than many closed-source models, potentially limiting its ability to handle highly complex problems.
  • Performance Gap: While competitive with 2024 models, it lags behind the latest closed-source models like GPT-4o (76.6% on MATH), indicating room for improvement.
  • Tool Use Controversy: There is debate about whether tool use reflects true mathematical understanding or reliance on external resources, which could impact its effectiveness in educational settings where intrinsic reasoning is key.
  • Training Data Concerns: The reliance on Common Crawl data may introduce biases or noise, potentially affecting the model’s precision in mathematical tasks.
Critical Perspective and Future Directions
From a critical standpoint, DeepSeek Math represents a significant step forward for open-source AI in mathematical reasoning. However, to remain relevant, the developers must address the performance gap with closed-source models by scaling up the model size, improving training data quality, and enhancing intrinsic reasoning capabilities. The tool use feature is a pragmatic approach, aligning with real-world AI applications, but further research is needed to ensure it complements rather than compensates for the model’s understanding.
Future directions could include integrating DeepSeek Math with educational platforms, assisting students and researchers in solving complex problems, and exploring its potential in formal theorem proving. Additionally, ensuring safety and ethical use is paramount, given its open-source nature, to prevent misuse in generating misinformation or solving problems that could have harmful implications.
Conclusion
DeepSeek Math is a valuable addition to the open-source AI ecosystem, particularly for its mathematical reasoning capabilities. Its competitive benchmark scores, user-friendly interface, and commercial-friendly licensing make it a promising tool for researchers and developers. However, challenges remain in closing the performance gap with closed-source models and addressing concerns about tool use and training data. As of 09:05 PM EST on Sunday, March 2, 2025, it stands as a building block for future advancements in AI-driven mathematical intelligence, with potential to evolve into a more robust and versatile system.

Key Citations

1. Introduction

DeepSeekMath is initialized with DeepSeek-Coder-v1.5 7B and continues pre-training on math-related tokens sourced from Common Crawl, together with natural language and code data for 500B tokens. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. For research purposes, we release checkpoints of base, instruct, and RL models to the public.

table

2. Evaluation Results

DeepSeekMath-Base 7B

We conduct a comprehensive assessment of the mathematical capabilities of DeepSeekMath-Base 7B, focusing on its ability to produce self-contained mathematical solutions without relying on external tools, solve math problems using tools, and conduct formal theorem proving. Beyond mathematics, we also provide a more general profile of the base model, including its performance of natural language understanding, reasoning, and programming skills.

  • Mathematical problem solving with step-by-step reasoning

table

  • Mathematical problem solving with tool use

table

  • Natural Language Understanding, Reasoning, and Code

table

The evaluation results from the tables above can be summarized as follows:

  • Superior Mathematical Reasoning: On the competition-level MATH dataset, DeepSeekMath-Base 7B outperforms existing open-source base models by more than 10% in absolute terms through few-shot chain-of-thought prompting, and also surpasses Minerva 540B.
  • Strong Tool Use Ability: Continuing pre-training with DeepSeekCoder-Base-7B-v1.5 enables DeepSeekMath-Base 7B to more effectively solve and prove mathematical problems by writing programs.
  • Comparable Reasoning and Coding Performance: DeepSeekMath-Base 7B achieves performance in reasoning and coding that is comparable to that of DeepSeekCoder-Base-7B-v1.5.

DeepSeekMath-Instruct and -RL 7B

DeepSeekMath-Instruct 7B is a mathematically instructed tuning model derived from DeepSeekMath-Base 7B, while DeepSeekMath-RL 7B is trained on the foundation of DeepSeekMath-Instruct 7B, utilizing our proposed Group Relative Policy Optimization (GRPO) algorithm.

We evaluate mathematical performance both without and with tool use, on 4 quantitative reasoning benchmarks in English and Chinese. As shown in Table, DeepSeekMath-Instruct 7B demonstrates strong performance of step-by-step reasoning, and DeepSeekMath-RL 7B approaches an accuracy of 60% on MATH with tool use, surpassing all existing open-source models.

table

3. Data Collection

  • Step 1: Select OpenWebMath, a collection of high-quality mathematical web texts, as our initial seed corpus for training a FastText model.
  • Step 2: Use the FastText model to retrieve mathematical web pages from the deduplicated Common Crawl database.
  • Step 3: Identify potential math-related domains through statistical analysis.
  • Step 4: Manually annotate URLs within these identified domains that are associated with mathematical content.
  • Step 5: Add web pages linked to these annotated URLs, but not yet collected, to the seed corpus. Jump to step 1 until four iterations.

table

After four iterations of data collection, we end up with 35.5M mathematical web pages, totaling 120B tokens.

4. Model Downloads

We release the DeepSeekMath 7B, including base, instruct and RL models, to the public. To support a broader and more diverse range of research within both academic and commercial communities. Please note that the use of this model is subject to the terms outlined in License section. Commercial usage is permitted under these terms.

Huggingface

ModelSequence LengthDownload
DeepSeekMath-Base 7B4096🤗 HuggingFace
DeepSeekMath-Instruct 7B4096🤗 HuggingFace
DeepSeekMath-RL 7B4096🤗 HuggingFace

Comments