文章目录一、前言二、DeepSeekMath核心目标主要贡献关键性能核心结论摘要1. 引言1.1. 贡献1.2. 评估与指标总结一、前言仅供参考未经实验验证。因DeepSeekMath这篇论文提出了重要的GRPO算法加之后面DeepSeeMathV2思想的重要性我们有必要读一下这篇论文。二、DeepSeekMath论文标题DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsDeepSeekMath推动开放语言模型中数学推理的极限作者Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo机构DeepSeek-AI主导、清华大学、北京大学发表时间2024年2月6日arXiv:2402.03300GitHubhttps://github.com/deepseek-ai/DeepSeek-Math核心目标解决开源语言模型在数学推理能力上远落后于 GPT-4、Gemini-Ultra 等闭源模型的问题提出一个 7B 参数的开源模型在竞赛级数学基准上首次接近闭源顶尖水平。主要贡献1. 大规模数学预训练数据DeepSeekMath Corpus从 Common Crawl 中通过迭代式 fastText 分类器筛选出120B tokens的数学相关网页数据数据量约为 Minerva 的 7 倍、OpenWebMath 的 9 倍覆盖多语言以英语和中文为主并做了严格的去污染处理过滤掉与测试集重叠的内容实验表明数据质量高DeepSeekMath-Base 7B 在 MATH 上超越 540B 参数的 Minerva2. 模型训练策略初始化基于 DeepSeek-Coder-Base-v1.5 7B而非通用 LLM发现代码预训练对数学推理有显著帮助数据配比56% 数学语料 4% AlgebraicStack 10% arXiv 20% GitHub 代码 10% 自然语言指令微调涵盖 Chain-of-Thought (CoT)、Program-of-Thought (PoT) 和 Tool-Integrated Reasoning 三种格式3. 提出 GRPOGroup Relative Policy OptimizationPPO 的变体无需单独的 Critic 模型通过组内相对分数来估计 baseline大幅降低训练内存开销仅用指令微调数据的子集进行 RL就在多个基准上获得显著提升GSM8K: 82.9% → 88.2%MATH: 46.8% → 51.7%CMATH中文: 84.6% → 88.8%4. 统一范式将 RFT拒绝采样微调、DPO、PPO、GRPO 统一理解为直接/简化版强化学习的不同形式系统对比了在线 vs 离线训练、结果监督 vs 过程监督、单轮 vs 迭代 RL 等关键性能模型参数MATH (竞赛级)GSM8KGPT-4 API闭源~52%~92%Gemini-Ultra闭源~53%-DeepSeekMath-RL 7B7B51.7%88.2%DeepSeekMath-Instruct 7B7B46.8%82.9%Llemma-34B34B~36%~54%Mistral 7B7B~28%~47%自一致性64 个样本投票可将 MATH 提升至60.9%核心结论公开网页数据蕴含巨大数学价值通过精心设计的筛选 pipelineCommon Crawl 可以产出高质量、大规模的数学预训练语料代码预训练有助于数学推理从代码模型继续训练比从通用 LLM 初始化效果更好GRPO 高效且有效去掉 Critic 模型后RL 训练资源大幅减少同时效果优于指令微调模型数学训练提升通用推理在 MMLU、BBH 等通用推理基准上也有提升没有出现灾难性遗忘摘要Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pretraining DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data.数学推理因其复杂性和结构性而对语言模型构成重大挑战。在本文中我们介绍了DeepSeekMath 7B它通过Common Crawl中的120B数学相关标记以及自然语言和代码数据继续对DeepSeek-Coder-Base-v1.5 7B进行预训练。DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline.DeepSeekMath 7B 在不依赖外部工具包和投票技术的情况下在竞赛级别的MATH基准测试中取得了 51.7% 的优异成绩接近 Gemini-Ultra 和 GPT-4 的性能水平。DeepSeekMath 7B 在 MATH 基准测试上通过 64 个样本的自洽性达到了 60.9% 的准确率。DeepSeekMath 的数学推理能力归因于两个关键因素首先我们通过精心设计的 数据选择流程挖掘了公开网络数据的巨大潜力。Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.其次我们引入了群相对策略优化GRPO它是近端策略优化PPO的一个变体在增强数学推理能力的同时优化了PPO的内存使用。Figure 1 Top1 accuracy of open-source models on the competition-level MATH benchmark (Hendrycks et al., 2021) without the use of external toolkits and voting techniques.图1 在竞赛级别的MATH基准测试Hendrycks et al., 2021上不使用外部工具包和投票技术的情况下开源模型的Top1准确率。1. 引言Large language models (LLM) have revolutionized the approach to mathematical reasoning in artificial intelligence, spurring significant advancements in both the quantitative reasoning benchmark (Hendrycks et al., 2021) and the geometry reasoning benchmark (Trinh et al., 2024). Moreover, these models have proven instrumental in assisting humans in solving complex mathematical problems (Tao, 2023).大型语言模型LLM彻底改变了人工智能中数学推理的方法从而在定量推理基准Hendrycks et al., 2021和几何推理基准Trinh et al., 2024方面都取得了显著进展。此外这些模型已被证明有助于人类解决复杂的数学问题Tao, 2023。However, cutting-edge models such as GPT-4 (OpenAI, 2023) and Gemini-Ultra (Anil et al., 2023) are not publicly available, and the currently accessible open-source models considerably trail behind in performance.然而诸如 GPT-4 (OpenAI, 2023) 和 Gemini-Ultra (Anil et al., 2023) 等前沿模型尚未公开提供并且目前可访问的开源模型在性能方面明显落后。In this study, we introduce DeepSeekMath, a domain-specific language model that significantly outperforms the mathematical capabilities of open-source models and approaches the performance level of GPT-4 on academic benchmarks. To achieve this, we create the DeepSeekMath Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens.在本研究中我们介绍了 DeepSeekMath一种特定领域的语言模型它在数学能力方面显著优于开源模型并在学术基准测试中接近 GPT-4 的性能水平。为了实现这一目标我们创建了 DeepSeekMath 语料库这是一个大规模高质量的预训练语料库包含 1200 亿个数学 tokens。This dataset is extracted from the Common Crawl (CC) using a fastText-based classifier (Joulin et al., 2016). In the initial iteration, the classifier is trained using instances from OpenWebMath (Paster et al., 2023) as positive examples, while incorporating a diverse selection of other web pages to serve as negative examples.此数据集是从 Common Crawl (CC) 中提取的使用了基于 fastText 的分类器 (Joulin et al., 2016)。在初始迭代中该分类器使用来自 OpenWebMath (Paster et al., 2023) 的实例作为正例进行训练同时结合了各种其他网页作为负例。Subsequently, we employ the classifier to mine additional positive instances from the CC, which are further refined through human annotation. The classifier is then updated with this enhanced dataset to improve its performance. The evaluation results indicate that the large-scale corpus is of high quality, as our base model DeepSeekMath-Base 7B achieves 64.2% on GSM8K (Cobbe et al., 2021) and 36.2% on the competition-level MATH dataset (Hendrycks et al., 2021), outperforming Minerva 540B (Lewkowycz et al., 2022a).随后我们采用分类器从CC中挖掘额外的正例并通过人工标注进一步提炼。然后使用这个增强的数据集更新分类器以提高其性能。评估结果表明大规模语料库具有高质量因为我们的基础模型DeepSeekMath-Base 7B在GSM8KCobbe et al., 2021上达到了64.2%在竞赛级别的MATH数据集Hendrycks et al., 2021)上达到了36.2%超过了Minerva 540BLewkowycz et al., 2022a。In addition, the DeepSeekMath Corpus is multilingual, so we notice an improvement in Chinese mathematical benchmarks (Wei et al., 2023; Zhong et al., 2023). We believe that our experience in mathematical data processing is a starting point for the research community, and there is significant room for improvement in the future.此外DeepSeekMath语料库是多语种的因此我们注意到中文数学基准测试的改进(Wei et al., 2023; Zhong et al., 2023)。我们相信我们在数学数据处理方面的经验是研究社区的起点未来还有很大的改进空间。DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024), as we notice that starting from a code training model is a better choice compared to a general LLM. Furthermore, we observe the math training also improves model capability on MMLU (Hendrycks et al., 2020) and BBH benchmarks (Suzgun et al., 2022), indicating it does not only enhance the model’s mathematical abilities but also amplifies general reasoning capabilities.DeepSeekMath-Base 使用 DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024)) 初始化因为我们注意到与通用 LLM 相比从代码训练模型开始是一个更好的选择。此外我们观察到数学训练也提高了模型在 MMLU (Hendrycks et al., 2020) 和 BBH 基准 (Suzgun et al., 2022)) 上的能力表明它不仅增强了模型的数学能力还提升了一般的推理能力。After pre-training, we apply mathematical instruction tuning to DeepSeekMath-Base with chain-of-thought (Wei et al., 2022), program-of-thought (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning (Gou et al., 2023) data. The resulting model DeepSeekMath-Instruct 7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models.在预训练之后我们对DeepSeekMath-Base应用数学指令微调使用思维链Wei et al., 2022)、程序思维Chen et al., 2022Gao et al., 2023)以及工具集成推理Gou et al., 2023数据。由此产生的模型DeepSeekMath-Instruct 7B击败了所有7B同类模型并且与70B开源指令微调模型具有可比性。Furthermore, we introduce the Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9%→ \rightarrow→88.2%, MATH: 46.8%→ \rightarrow→51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6%→ \rightarrow→88.8%) during the reinforcement learning phase.此外我们还引入了群组相对策略优化GRPO这是近端策略优化PPOSchulman et al., 2017的一个变体强化学习RL算法。GRPO舍弃了评估器模型转而从群组分数估计基线显著减少了训练资源。GRPO仅使用一部分英语指令微调数据在强化学习阶段就在领域内GSM8K82.9% → 88.2%MATH46.8% → 51.7%和领域外数学任务例如CMATH84.6% → 88.8%上获得了相比强大的DeepSeekMath-Instruct的显著改进。We also provide a unified paradigm to understand different methods, such as Rejection Sampling Fine-Tuning (RFT) (Yuan et al., 2023a), Direct Preference Optimization (DPO) (Rafailov et al., 2023), PPO and GRPO. Based on such a unified paradigm, we find that all these methods are conceptualized as either direct or simplified RL techniques.我们还提供了一个统一的范式来理解不同的方法例如拒绝采样微调Rejection Sampling Fine-Tuning, RFTYuan 等人2023a)、直接偏好优化Direct Preference Optimization, DPORafailov 等人2023)、PPO 和 GRPO。基于这种统一的范式我们发现所有这些方法都可以被概念化为直接或简化的强化学习RL技术。We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative RL and so on, to deeply investigate the essential elements of this paradigm. At last, we explain why our RL boosts the performance of instruction-tuned models, and further summarize potential directions to achieve more effective RL based on this unified paradigm.我们还进行了广泛的实验例如在线与离线训练、结果与过程监督、单轮与迭代强化学习等以深入探究该范式的基本要素。最后我们解释了为何我们的强化学习能提升指令微调模型的性能并在此统一范式的基础上进一步总结了实现更有效强化学习的潜在方向。1.1. 贡献Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning.我们的贡献包括可扩展的数学预训练以及对强化学习的探索和分析。Math Pre-Training at Scale大规模数学预训练• Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a) and 9 times the size of the recently released OpenWebMath (Paster et al., 2023).• 我们的研究提供了令人信服的证据表明公开可访问的Common Crawl数据包含对数学目的有价值的信息。通过实施精心设计的数据选择流程我们成功构建了DeepSeekMath语料库这是一个高质量的数据集包含从网络页面过滤出的1200亿个数学内容tokens几乎是Minerva使用的数学网页大小的7倍Lewkowycz et al., 2022a是最近发布的OpenWebMath大小的9倍Paster et al., 2023。• Our pre-trained base model DeepSeekMath-Base 7B achieves comparable performance with Minerva 540B (Lewkowycz et al., 2022a), indicating the number of parameters is not the only key factor in mathematical reasoning capability. A smaller model pre-trained on high-quality data could achieve strong performance as well.• 我们的预训练基础模型DeepSeekMath-Base 7B取得了与Minerva 540BLewkowycz et al., [NT0]2022a相当的性能表明参数的数量并非数学推理能力唯一的关键因素。在高质量数据上预训练的较小模型也能取得强大的性能。• We share our findings from math training experiments. Code training prior to math training improves models’ ability to solve mathematical problems both with and without tool use. This offers a partial answer to the long-standing question: does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning.• 我们分享了数学训练实验的结果。在数学训练之前进行代码训练可以提高模型解决数学问题的能力无论是否使用工具。这为长期存在的问题提供了一个部分答案代码训练是否能提高推理能力我们认为可以至少对于数学推理而言。• Although training on arXiv papers is common, especially in many math-related papers, it brings no notable improvements on all mathematical benchmarks adopted in this paper.• 尽管在arXiv论文上进行训练很常见尤其是在许多与数学相关的论文中但它并没有给本文采用的所有数学基准测试带来显著的改进。Exploration and Analysis of Reinforcement Learning强化学习的探索与分析• We introduce Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources compared to Proximal Policy Optimization (PPO).• 我们介绍了一种高效且有效的强化学习算法即群体相对策略优化GRPO。GRPO放弃了评论家模型而是从群体得分中估计基线与近端策略优化PPO相比显著减少了训练资源。• We demonstrate that GRPO significantly enhances the performance of our instructiontuned model DeepSeekMath-Instruct, by solely using the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process.• 我们证明仅使用指令调优数据GRPO 显著提升了我们的指令调优模型 DeepSeekMath-Instruct 的性能。此外我们观察到在强化学习过程中领域外性能也得到了提升。• We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm.• 我们提供了一个统一的范式来理解不同的方法例如RFT、DPO、PPO和GRPO。我们还进行了广泛的实验例如在线与离线训练、结果与过程监督、单轮与迭代强化学习等等以深入研究该范式的基本要素。• Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize several potential directions to achieve more effective reinforcement learning of LLMs.• 基于我们统一的范式我们探讨了强化学习有效背后的原因并总结了实现LLM更有效的强化学习的几个潜在方向。1.2. 评估与指标总结• English and Chinese Mathematical Reasoning: We conduct comprehensive assessments of our models on English and Chinese benchmarks, covering mathematical problems• 英语和中文数学推理我们对模型在英语和中文基准上进行了全面评估涵盖了数学问题from grade-school level to college level. English benchmarks include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), MMLU-STEM (Hendrycks et al., 2020). Chinese benchmarks include MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023), and Gaokao-MathQA (Zhong et al., 2023).从小学到大学水平。英文基准包括GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), MMLU-STEM (Hendrycks et al., 2020).中文基准包括MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023), and Gaokao-MathQA (Zhong et al., 2023).We evaluate models’ ability to generate self-contained text solutions without tool use, and also the ability to solve problems using Python.我们评估模型生成不依赖工具的自包含文本解决方案的能力以及使用 Python 解决问题的能力。On English benchmarks, DeepSeekMath-Base is competitive with the closed-source Minerva 540B (Lewkowycz et al., 2022a), and surpasses all open-source base models (e.g., Mistral 7B (Jiang et al., 2023) and Llemma-34B (Azerbayev et al., 2023)), regardless of whether they’ve undergone math pre-training or not, often by a significant margin.在英文基准测试中DeepSeekMath-Base 模型在与闭源的 Minerva 540B (Lewkowycz et al., 2022a) 竞争时表现优异并且超越了所有开源基础模型例如Mistral 7B (Jiang et al., 2023) 和 Llemma-34B (Azerbayev et al., 2023))无论它们是否经过数学预训练通常都以显著的优势领先。Notably, DeepSeekMath-Base is superior on Chinese benchmarks, likely because we don’t follow previous works (Azerbayev et al., 2023; Lewkowycz et al., 2022a) to collect English-only math pre-training data, and also include high-quality non-English ones. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demonstrate strong performance, obtaining an accuracy of over 50% on the competition-level MATH dataset for the first time within the open-source community.值得注意的是DeepSeekMath-Base 在中文基准测试上表现更优这可能是因为我们没有像以往的研究Azerbayev et al., 2023Lewkowycz et al., 2022a那样仅收集英文数学预训练数据而是包含了高质量的非英文数据。通过数学指令微调和强化学习所得的 DeepSeekMath-Instruct 和 DeepSeekMath-RL 表现强劲首次在开源社区中实现了在竞赛级 MATH 数据集上超过 50% 的准确率。• Formal Mathematics: We evaluate DeepSeekMath-Base using the informal-to-formal theorem proving task from (Jiang et al., 2022) on miniF2F (Zheng et al., 2021) with Isabelle (Wenzel et al., 2008) chosen to be the proof assistant. DeepSeekMath-Base demonstrates strong few-shot autoformalization performance.• 形式化数学我们使用来自 (Jiang et al., 2022) 的非形式化到形式化定理证明任务在 miniF2F (Zheng et al., 2021) 上评估 DeepSeekMath-Base并选择 Isabelle (Wenzel et al., 2008) 作为证明助手。DeepSeekMath-Base 展示了强大的少样本自动形式化性能。• Natural Language Understanding, Reasoning, and Code: To build a comprehensive profile of models’ general understanding, reasoning, and coding capabilities, we evaluate DeepSeekMath-Base on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) which encompasses 57 multiple-choice tasks covering diverse subjects, BIG-Bench Hard (BBH) (Suzgun et al., 2022) which consists of 23 challenging tasks that mostly require multi-step reasoning to solve, as well as HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) which are widely used to evaluate code language models. Math pre-training benefits both language understanding and reasoning performance.• 自然语言理解、推理和代码为了构建模型通用理解、推理和编码能力的综合概况我们评估了 DeepSeekMath-Base 在大规模多任务语言理解MMLU基准Hendrycks 等人2020上的表现该基准包含 57 个涵盖不同学科的多项选择题在 BIG-Bench Hard (BBH)Suzgun 等人2022上的表现该基准由 23 个具有挑战性的任务组成这些任务主要需要多步推理才能解决以及在 HumanEvalChen 等人2021和 MBPPAustin 等人2021上的表现它们被广泛用于评估代码语言模型。数学预训练有益于语言理解和推理性能。
DeepSeekMath:推动开放语言模型中数学推理的极限
文章目录一、前言二、DeepSeekMath核心目标主要贡献关键性能核心结论摘要1. 引言1.1. 贡献1.2. 评估与指标总结一、前言仅供参考未经实验验证。因DeepSeekMath这篇论文提出了重要的GRPO算法加之后面DeepSeeMathV2思想的重要性我们有必要读一下这篇论文。二、DeepSeekMath论文标题DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsDeepSeekMath推动开放语言模型中数学推理的极限作者Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo机构DeepSeek-AI主导、清华大学、北京大学发表时间2024年2月6日arXiv:2402.03300GitHubhttps://github.com/deepseek-ai/DeepSeek-Math核心目标解决开源语言模型在数学推理能力上远落后于 GPT-4、Gemini-Ultra 等闭源模型的问题提出一个 7B 参数的开源模型在竞赛级数学基准上首次接近闭源顶尖水平。主要贡献1. 大规模数学预训练数据DeepSeekMath Corpus从 Common Crawl 中通过迭代式 fastText 分类器筛选出120B tokens的数学相关网页数据数据量约为 Minerva 的 7 倍、OpenWebMath 的 9 倍覆盖多语言以英语和中文为主并做了严格的去污染处理过滤掉与测试集重叠的内容实验表明数据质量高DeepSeekMath-Base 7B 在 MATH 上超越 540B 参数的 Minerva2. 模型训练策略初始化基于 DeepSeek-Coder-Base-v1.5 7B而非通用 LLM发现代码预训练对数学推理有显著帮助数据配比56% 数学语料 4% AlgebraicStack 10% arXiv 20% GitHub 代码 10% 自然语言指令微调涵盖 Chain-of-Thought (CoT)、Program-of-Thought (PoT) 和 Tool-Integrated Reasoning 三种格式3. 提出 GRPOGroup Relative Policy OptimizationPPO 的变体无需单独的 Critic 模型通过组内相对分数来估计 baseline大幅降低训练内存开销仅用指令微调数据的子集进行 RL就在多个基准上获得显著提升GSM8K: 82.9% → 88.2%MATH: 46.8% → 51.7%CMATH中文: 84.6% → 88.8%4. 统一范式将 RFT拒绝采样微调、DPO、PPO、GRPO 统一理解为直接/简化版强化学习的不同形式系统对比了在线 vs 离线训练、结果监督 vs 过程监督、单轮 vs 迭代 RL 等关键性能模型参数MATH (竞赛级)GSM8KGPT-4 API闭源~52%~92%Gemini-Ultra闭源~53%-DeepSeekMath-RL 7B7B51.7%88.2%DeepSeekMath-Instruct 7B7B46.8%82.9%Llemma-34B34B~36%~54%Mistral 7B7B~28%~47%自一致性64 个样本投票可将 MATH 提升至60.9%核心结论公开网页数据蕴含巨大数学价值通过精心设计的筛选 pipelineCommon Crawl 可以产出高质量、大规模的数学预训练语料代码预训练有助于数学推理从代码模型继续训练比从通用 LLM 初始化效果更好GRPO 高效且有效去掉 Critic 模型后RL 训练资源大幅减少同时效果优于指令微调模型数学训练提升通用推理在 MMLU、BBH 等通用推理基准上也有提升没有出现灾难性遗忘摘要Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pretraining DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data.数学推理因其复杂性和结构性而对语言模型构成重大挑战。在本文中我们介绍了DeepSeekMath 7B它通过Common Crawl中的120B数学相关标记以及自然语言和代码数据继续对DeepSeek-Coder-Base-v1.5 7B进行预训练。DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline.DeepSeekMath 7B 在不依赖外部工具包和投票技术的情况下在竞赛级别的MATH基准测试中取得了 51.7% 的优异成绩接近 Gemini-Ultra 和 GPT-4 的性能水平。DeepSeekMath 7B 在 MATH 基准测试上通过 64 个样本的自洽性达到了 60.9% 的准确率。DeepSeekMath 的数学推理能力归因于两个关键因素首先我们通过精心设计的 数据选择流程挖掘了公开网络数据的巨大潜力。Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.其次我们引入了群相对策略优化GRPO它是近端策略优化PPO的一个变体在增强数学推理能力的同时优化了PPO的内存使用。Figure 1 Top1 accuracy of open-source models on the competition-level MATH benchmark (Hendrycks et al., 2021) without the use of external toolkits and voting techniques.图1 在竞赛级别的MATH基准测试Hendrycks et al., 2021上不使用外部工具包和投票技术的情况下开源模型的Top1准确率。1. 引言Large language models (LLM) have revolutionized the approach to mathematical reasoning in artificial intelligence, spurring significant advancements in both the quantitative reasoning benchmark (Hendrycks et al., 2021) and the geometry reasoning benchmark (Trinh et al., 2024). Moreover, these models have proven instrumental in assisting humans in solving complex mathematical problems (Tao, 2023).大型语言模型LLM彻底改变了人工智能中数学推理的方法从而在定量推理基准Hendrycks et al., 2021和几何推理基准Trinh et al., 2024方面都取得了显著进展。此外这些模型已被证明有助于人类解决复杂的数学问题Tao, 2023。However, cutting-edge models such as GPT-4 (OpenAI, 2023) and Gemini-Ultra (Anil et al., 2023) are not publicly available, and the currently accessible open-source models considerably trail behind in performance.然而诸如 GPT-4 (OpenAI, 2023) 和 Gemini-Ultra (Anil et al., 2023) 等前沿模型尚未公开提供并且目前可访问的开源模型在性能方面明显落后。In this study, we introduce DeepSeekMath, a domain-specific language model that significantly outperforms the mathematical capabilities of open-source models and approaches the performance level of GPT-4 on academic benchmarks. To achieve this, we create the DeepSeekMath Corpus, a large-scale high-quality pre-training corpus comprising 120B math tokens.在本研究中我们介绍了 DeepSeekMath一种特定领域的语言模型它在数学能力方面显著优于开源模型并在学术基准测试中接近 GPT-4 的性能水平。为了实现这一目标我们创建了 DeepSeekMath 语料库这是一个大规模高质量的预训练语料库包含 1200 亿个数学 tokens。This dataset is extracted from the Common Crawl (CC) using a fastText-based classifier (Joulin et al., 2016). In the initial iteration, the classifier is trained using instances from OpenWebMath (Paster et al., 2023) as positive examples, while incorporating a diverse selection of other web pages to serve as negative examples.此数据集是从 Common Crawl (CC) 中提取的使用了基于 fastText 的分类器 (Joulin et al., 2016)。在初始迭代中该分类器使用来自 OpenWebMath (Paster et al., 2023) 的实例作为正例进行训练同时结合了各种其他网页作为负例。Subsequently, we employ the classifier to mine additional positive instances from the CC, which are further refined through human annotation. The classifier is then updated with this enhanced dataset to improve its performance. The evaluation results indicate that the large-scale corpus is of high quality, as our base model DeepSeekMath-Base 7B achieves 64.2% on GSM8K (Cobbe et al., 2021) and 36.2% on the competition-level MATH dataset (Hendrycks et al., 2021), outperforming Minerva 540B (Lewkowycz et al., 2022a).随后我们采用分类器从CC中挖掘额外的正例并通过人工标注进一步提炼。然后使用这个增强的数据集更新分类器以提高其性能。评估结果表明大规模语料库具有高质量因为我们的基础模型DeepSeekMath-Base 7B在GSM8KCobbe et al., 2021上达到了64.2%在竞赛级别的MATH数据集Hendrycks et al., 2021)上达到了36.2%超过了Minerva 540BLewkowycz et al., 2022a。In addition, the DeepSeekMath Corpus is multilingual, so we notice an improvement in Chinese mathematical benchmarks (Wei et al., 2023; Zhong et al., 2023). We believe that our experience in mathematical data processing is a starting point for the research community, and there is significant room for improvement in the future.此外DeepSeekMath语料库是多语种的因此我们注意到中文数学基准测试的改进(Wei et al., 2023; Zhong et al., 2023)。我们相信我们在数学数据处理方面的经验是研究社区的起点未来还有很大的改进空间。DeepSeekMath-Base is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024), as we notice that starting from a code training model is a better choice compared to a general LLM. Furthermore, we observe the math training also improves model capability on MMLU (Hendrycks et al., 2020) and BBH benchmarks (Suzgun et al., 2022), indicating it does not only enhance the model’s mathematical abilities but also amplifies general reasoning capabilities.DeepSeekMath-Base 使用 DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024)) 初始化因为我们注意到与通用 LLM 相比从代码训练模型开始是一个更好的选择。此外我们观察到数学训练也提高了模型在 MMLU (Hendrycks et al., 2020) 和 BBH 基准 (Suzgun et al., 2022)) 上的能力表明它不仅增强了模型的数学能力还提升了一般的推理能力。After pre-training, we apply mathematical instruction tuning to DeepSeekMath-Base with chain-of-thought (Wei et al., 2022), program-of-thought (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning (Gou et al., 2023) data. The resulting model DeepSeekMath-Instruct 7B beats all 7B counterparts and is comparable with 70B open-source instruction-tuned models.在预训练之后我们对DeepSeekMath-Base应用数学指令微调使用思维链Wei et al., 2022)、程序思维Chen et al., 2022Gao et al., 2023)以及工具集成推理Gou et al., 2023数据。由此产生的模型DeepSeekMath-Instruct 7B击败了所有7B同类模型并且与70B开源指令微调模型具有可比性。Furthermore, we introduce the Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9%→ \rightarrow→88.2%, MATH: 46.8%→ \rightarrow→51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6%→ \rightarrow→88.8%) during the reinforcement learning phase.此外我们还引入了群组相对策略优化GRPO这是近端策略优化PPOSchulman et al., 2017的一个变体强化学习RL算法。GRPO舍弃了评估器模型转而从群组分数估计基线显著减少了训练资源。GRPO仅使用一部分英语指令微调数据在强化学习阶段就在领域内GSM8K82.9% → 88.2%MATH46.8% → 51.7%和领域外数学任务例如CMATH84.6% → 88.8%上获得了相比强大的DeepSeekMath-Instruct的显著改进。We also provide a unified paradigm to understand different methods, such as Rejection Sampling Fine-Tuning (RFT) (Yuan et al., 2023a), Direct Preference Optimization (DPO) (Rafailov et al., 2023), PPO and GRPO. Based on such a unified paradigm, we find that all these methods are conceptualized as either direct or simplified RL techniques.我们还提供了一个统一的范式来理解不同的方法例如拒绝采样微调Rejection Sampling Fine-Tuning, RFTYuan 等人2023a)、直接偏好优化Direct Preference Optimization, DPORafailov 等人2023)、PPO 和 GRPO。基于这种统一的范式我们发现所有这些方法都可以被概念化为直接或简化的强化学习RL技术。We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative RL and so on, to deeply investigate the essential elements of this paradigm. At last, we explain why our RL boosts the performance of instruction-tuned models, and further summarize potential directions to achieve more effective RL based on this unified paradigm.我们还进行了广泛的实验例如在线与离线训练、结果与过程监督、单轮与迭代强化学习等以深入探究该范式的基本要素。最后我们解释了为何我们的强化学习能提升指令微调模型的性能并在此统一范式的基础上进一步总结了实现更有效强化学习的潜在方向。1.1. 贡献Our contribution includes scalable math pre-training, along with the exploration and analysis of reinforcement learning.我们的贡献包括可扩展的数学预训练以及对强化学习的探索和分析。Math Pre-Training at Scale大规模数学预训练• Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes. By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a) and 9 times the size of the recently released OpenWebMath (Paster et al., 2023).• 我们的研究提供了令人信服的证据表明公开可访问的Common Crawl数据包含对数学目的有价值的信息。通过实施精心设计的数据选择流程我们成功构建了DeepSeekMath语料库这是一个高质量的数据集包含从网络页面过滤出的1200亿个数学内容tokens几乎是Minerva使用的数学网页大小的7倍Lewkowycz et al., 2022a是最近发布的OpenWebMath大小的9倍Paster et al., 2023。• Our pre-trained base model DeepSeekMath-Base 7B achieves comparable performance with Minerva 540B (Lewkowycz et al., 2022a), indicating the number of parameters is not the only key factor in mathematical reasoning capability. A smaller model pre-trained on high-quality data could achieve strong performance as well.• 我们的预训练基础模型DeepSeekMath-Base 7B取得了与Minerva 540BLewkowycz et al., [NT0]2022a相当的性能表明参数的数量并非数学推理能力唯一的关键因素。在高质量数据上预训练的较小模型也能取得强大的性能。• We share our findings from math training experiments. Code training prior to math training improves models’ ability to solve mathematical problems both with and without tool use. This offers a partial answer to the long-standing question: does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning.• 我们分享了数学训练实验的结果。在数学训练之前进行代码训练可以提高模型解决数学问题的能力无论是否使用工具。这为长期存在的问题提供了一个部分答案代码训练是否能提高推理能力我们认为可以至少对于数学推理而言。• Although training on arXiv papers is common, especially in many math-related papers, it brings no notable improvements on all mathematical benchmarks adopted in this paper.• 尽管在arXiv论文上进行训练很常见尤其是在许多与数学相关的论文中但它并没有给本文采用的所有数学基准测试带来显著的改进。Exploration and Analysis of Reinforcement Learning强化学习的探索与分析• We introduce Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources compared to Proximal Policy Optimization (PPO).• 我们介绍了一种高效且有效的强化学习算法即群体相对策略优化GRPO。GRPO放弃了评论家模型而是从群体得分中估计基线与近端策略优化PPO相比显著减少了训练资源。• We demonstrate that GRPO significantly enhances the performance of our instructiontuned model DeepSeekMath-Instruct, by solely using the instruction-tuning data. Furthermore, we observe enhancements in the out-of-domain performance during the reinforcement learning process.• 我们证明仅使用指令调优数据GRPO 显著提升了我们的指令调优模型 DeepSeekMath-Instruct 的性能。此外我们观察到在强化学习过程中领域外性能也得到了提升。• We provide a unified paradigm to understand different methods, such as RFT, DPO, PPO, and GRPO. We also conduct extensive experiments, e.g., online v.s. offline training, outcome v.s. process supervision, single-turn v.s. iterative reinforcement learning, and so on to deeply investigate the essential elements of this paradigm.• 我们提供了一个统一的范式来理解不同的方法例如RFT、DPO、PPO和GRPO。我们还进行了广泛的实验例如在线与离线训练、结果与过程监督、单轮与迭代强化学习等等以深入研究该范式的基本要素。• Based on our unified paradigm, we explore the reasons behind the effectiveness of reinforcement learning, and summarize several potential directions to achieve more effective reinforcement learning of LLMs.• 基于我们统一的范式我们探讨了强化学习有效背后的原因并总结了实现LLM更有效的强化学习的几个潜在方向。1.2. 评估与指标总结• English and Chinese Mathematical Reasoning: We conduct comprehensive assessments of our models on English and Chinese benchmarks, covering mathematical problems• 英语和中文数学推理我们对模型在英语和中文基准上进行了全面评估涵盖了数学问题from grade-school level to college level. English benchmarks include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), MMLU-STEM (Hendrycks et al., 2020). Chinese benchmarks include MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023), and Gaokao-MathQA (Zhong et al., 2023).从小学到大学水平。英文基准包括GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SAT (Azerbayev et al., 2023), OCW Courses (Lewkowycz et al., 2022a), MMLU-STEM (Hendrycks et al., 2020).中文基准包括MGSM-zh (Shi et al., 2023), CMATH (Wei et al., 2023), Gaokao-MathCloze (Zhong et al., 2023), and Gaokao-MathQA (Zhong et al., 2023).We evaluate models’ ability to generate self-contained text solutions without tool use, and also the ability to solve problems using Python.我们评估模型生成不依赖工具的自包含文本解决方案的能力以及使用 Python 解决问题的能力。On English benchmarks, DeepSeekMath-Base is competitive with the closed-source Minerva 540B (Lewkowycz et al., 2022a), and surpasses all open-source base models (e.g., Mistral 7B (Jiang et al., 2023) and Llemma-34B (Azerbayev et al., 2023)), regardless of whether they’ve undergone math pre-training or not, often by a significant margin.在英文基准测试中DeepSeekMath-Base 模型在与闭源的 Minerva 540B (Lewkowycz et al., 2022a) 竞争时表现优异并且超越了所有开源基础模型例如Mistral 7B (Jiang et al., 2023) 和 Llemma-34B (Azerbayev et al., 2023))无论它们是否经过数学预训练通常都以显著的优势领先。Notably, DeepSeekMath-Base is superior on Chinese benchmarks, likely because we don’t follow previous works (Azerbayev et al., 2023; Lewkowycz et al., 2022a) to collect English-only math pre-training data, and also include high-quality non-English ones. With mathematical instruction tuning and reinforcement learning, the resulting DeepSeekMath-Instruct and DeepSeekMath-RL demonstrate strong performance, obtaining an accuracy of over 50% on the competition-level MATH dataset for the first time within the open-source community.值得注意的是DeepSeekMath-Base 在中文基准测试上表现更优这可能是因为我们没有像以往的研究Azerbayev et al., 2023Lewkowycz et al., 2022a那样仅收集英文数学预训练数据而是包含了高质量的非英文数据。通过数学指令微调和强化学习所得的 DeepSeekMath-Instruct 和 DeepSeekMath-RL 表现强劲首次在开源社区中实现了在竞赛级 MATH 数据集上超过 50% 的准确率。• Formal Mathematics: We evaluate DeepSeekMath-Base using the informal-to-formal theorem proving task from (Jiang et al., 2022) on miniF2F (Zheng et al., 2021) with Isabelle (Wenzel et al., 2008) chosen to be the proof assistant. DeepSeekMath-Base demonstrates strong few-shot autoformalization performance.• 形式化数学我们使用来自 (Jiang et al., 2022) 的非形式化到形式化定理证明任务在 miniF2F (Zheng et al., 2021) 上评估 DeepSeekMath-Base并选择 Isabelle (Wenzel et al., 2008) 作为证明助手。DeepSeekMath-Base 展示了强大的少样本自动形式化性能。• Natural Language Understanding, Reasoning, and Code: To build a comprehensive profile of models’ general understanding, reasoning, and coding capabilities, we evaluate DeepSeekMath-Base on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) which encompasses 57 multiple-choice tasks covering diverse subjects, BIG-Bench Hard (BBH) (Suzgun et al., 2022) which consists of 23 challenging tasks that mostly require multi-step reasoning to solve, as well as HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) which are widely used to evaluate code language models. Math pre-training benefits both language understanding and reasoning performance.• 自然语言理解、推理和代码为了构建模型通用理解、推理和编码能力的综合概况我们评估了 DeepSeekMath-Base 在大规模多任务语言理解MMLU基准Hendrycks 等人2020上的表现该基准包含 57 个涵盖不同学科的多项选择题在 BIG-Bench Hard (BBH)Suzgun 等人2022上的表现该基准由 23 个具有挑战性的任务组成这些任务主要需要多步推理才能解决以及在 HumanEvalChen 等人2021和 MBPPAustin 等人2021上的表现它们被广泛用于评估代码语言模型。数学预训练有益于语言理解和推理性能。