Under the influence of Chinese AI companies, OpenAI had to disclose the secrets of O-series reinforcement learning. Today (February 12), OpenAI released a research paper report on the application of reasoning models in competitive programming, "Competitive Programming with Large Reasoning Models", which released the results of OpenAI's three reasoning models: o1, o1-ioi, and o3 in IOI (International Olympiad in Informatics) and CodeForces (a world-renowned online programming competition).

The paper shows that in IOI 2024, o3 scored 395.64 points under strict rules, achieving the gold medal, and its performance on CodeForces is comparable to that of human elite players. The paper specifically mentions that China's DeepSeek-R1 and Kimi k1.5 have shown through independent research that the use of the Thinking Chain Learning (COT) method can significantly improve the model's comprehensive performance in math problem solving and programming challenges. R1 and k1.5 are new reasoning models released simultaneously by DeepSeek and Kimi on January 20.

The paper compares the performance of general reasoning models and systems optimized for specific domains in competitive programming by improving the performance of large language models trained by reinforcement learning (RL) on complex coding and reasoning tasks. The results show that increasing reinforcement learning training computation and test-time computation can significantly improve model performance, bringing it close to the world's top human players. These models will unlock new application experiences in AI applications in science, coding, mathematics and other fields.

Original paper address: https://arxiv.org/abs/2502.06807

Computer Science > Machine Learning

Competitive Programming with Large Reasoning Models

We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.
Subjects:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:arXiv:2502.06807 [cs.LG]
 (or arXiv:2502.06807v2 [cs.LG] for this version)
 https://doi.org/10.48550/arXiv.2502.06807

Submission history

From: Ahmed El-Kishky [view email]
[v1] Mon, 3 Feb 2025 23:00:15 UTC (493 KB)
[v2] Tue, 18 Feb 2025 22:21:40 UTC (493 KB)

Comments