GPT-4.5 is 500 times better than DeepSeek! OpenAI ranks last in the benchmark test and loses its moat

Since OpenAI released GPT-4.5, Ilya's picture has become popular again.

The disappointing performance of GPT-4.5 once again confirms the value of Ilya’s words: pre-training has reached its limit, and reasoning scaling is the promising paradigm for the future.

GPT-4.5 did not improve on benchmarks, nor did it enhance reasoning; it simply became a model that was easier to work with, more creative, and less hallucinatory.

The "failure" of GPT-4.5 further proves that Ilya is right.

Now, all the reviews have come out, and the results show that OpenAI has failed miserably.

Judging from the evaluation of ARC-AGC, GPT-4.5 is almost at the same level as GPT-4o, and there seems to be no improvement in intelligence.


New York University professor Marcus directly published a long article criticizing: GPT-4.5 is a hollow burger.


One AI startup CEO said bluntly: On the most practical evaluation benchmark in his mind, Aider Polyglot,OpenAI 's "national treasure" GPT-4.5 is 500 times more expensive than DeepSeek - V3, but its performance is worse.

If this result is accurate, OpenAI will be in serious trouble and may even lose its moat completely!


At the same time, DeepSeek in China has brought open source critiques to people for 6 consecutive days, with the R1 model directly reduced by 75%.

In short, under the pressure from DeepSeek, xAI Grok 3, Anthropic's first hybrid model Cluade 3.7 Sonnet, etc., OpenAI, the former star, has obviously lost its glory.


“Is GPT-4.5 really that bad? I’m not mistaken.”

As mentioned above, the AI ​​startup CEO felt unbelievable after seeing the chart below, because the performance of GPT-4.5 Preview was directly at the bottom of the class.


For this purpose, he also verified with the table maker, who said that he had carefully checked the performance data and ran it multiple times to ensure that every result was correct.


GPT-4.5 has 10 times more pre-training computation than the base GPT-4 model, but is not good at anything. Does this make sense?

Some speculate that GPT-4.5 may not have undergone much supervised fine-tuning, as OpenAI intended it to be used as a base model or teacher model for future models (such as GPT-5) for further fine-tuning through reinforcement learning.

This may be the reason why it is not particularly strong in following code instructions.


Alternatively, the problem may lie in data mixing, because OpenAI used a completely new training mechanism this time, so there may be some "growing pains".

But what is disheartening is that many people within OpenAI who could have done this have now left.


Someone even said directly: "If DeepSeek had the amount of funding that OpenAI has, we're doomed."

Some people joked that this might be the so-called "trading IQ for EQ."

In any case, in everyone's eyes, OpenAI's first-mover advantage no longer exists.

Marcus: OpenAI has completely lost its moat


After forwarding this astonishing study, Marcus said that no matter what advantages OpenAI had two years ago, they have now completely lost their moat.

While they still have big names, large amounts of data, and many users, they do not have any decisive advantages over their competitors.

Scaling didn’t get them to the end of AGI. GPT-4.5 was very expensive, and GPT-5 failed.

Everyone began to wonder: Is this all OpenAI can offer?

Now, DeepSeek has sparked a price war that cuts into potential profits for larger models, and no killer app has emerged.

With every model response, OpenAI was losing money. The company was burning through money so fast, but with limited funding, even Microsoft no longer fully supported them.


If the company fails to quickly transform into a nonprofit organization, a large investment will become debt.

Moreover, Ilya, Murati, Schulman… many top figures have left.

If Son changes his mind, OpenAI will immediately face serious cash problems (Musk was right about one thing: they did not receive a large part of the funding for Stargate).

In short, Altman was indeed the right CEO for launching ChatGPT, but he did not have enough technical vision to lead OpenAI to the next stage.

In this article "GPT-4.5 is a hollow burger", Marcus also emphasized again: Scaling has hit a wall.

Before the release of GPT-4.5, he predicted that it would be a false alarm, and that LLM's pure scaling (whether increasing the amount of data or computing) had hit a wall.

In some respects, GPT-4.5 is not as good as Claude's previous version of the model.


For the first time, a respected AI forecaster was so disappointed that he postponed his prediction for when AGI would arrive.


Ultraman’s unusual calmness during the product launch is even more intriguing.

Instead of his usual hype about AGI, he acknowledged the costs of large-scale models but avoided mentioning AGI at all.


All in all, Marcus said his 2024 forecast remains strong.

Half a trillion dollars later, no one has found a viable business model, and no one except Nvidia and a few consulting firms has profited significantly.

No GPT-5, no moat.

“Scaling is a hypothesis. We have invested twice as much as the Apollo program, but have not achieved much substantial results so far.”

GPT-4.5: Not the best, but the most expensive


In short, judging from the input price, GPT-4.5 is ridiculously expensive:

  • 5 times of o1

  • 30 times that of GPT-4o

  • 68 times of o3-mini

  • 137 times that of DeepSeek-R1

  • 278 times that of DeepSeek-V3

But as mentioned earlier, GPT-4.5, as the "most expensive" model, is not the "best" in terms of performance.

No one ranked first in the running points

Scale AI, founded by well-known Chinese billionaire Alexandr Wang, regularly updates a set of LLM rankings SEAL based on private data sets, with a total of 15 on the homepage.

However, in this latest wave of rankings, GPT-4.5 Preview did not take first place in any item!


The best result of the whole event was the runner-up in the Intelligent Tool Usage (Chat) project - slightly better than Claude 3.7 Sonnet, but worse than the previous generation GPT-4o.


Next, GPT-4.5 ranked third in the EnginmaEval and Agentic Tool Use (Enterprise) projects.

Among them, the former requires the ability to solve problems creatively and integrate information from different fields; the latter evaluates the proficiency in the use of model tools, and is characterized by the need to combine multiple tools together.

They lost to their own o1/o1-preview and the competitor's latest Claude 3.7 Sonnet (Thingking) respectively.



In MultiChallenge, he ranked 4th, losing to o1, Claude 3.5 Sonnet and 3.7 Sonnet.

The MultiChallenge list is used to evaluate the ability of LLM to conduct multi-round conversations with human users, examining LLM's ability to follow instructions, allocate context, and reason in context in four aspects: instruction retention, user information reasoning and memory, reliable version editing, and self-consistency.


Ranked 5th in "The Last Test of Humanity".

This time, it not only lost to Anthropic's Claude, but also Gemini. And it was the Flash version.

As the name implies, this tests the depth of LLM reasoning (e.g., world-class math problems) and the breadth of its subject matter knowledge, providing an accurate measure of the model’s capabilities. Currently, no model has achieved a true accuracy of 10%.


Don't use it for programming

According to Aider's LLM programming rankings, OpenAI's AI models are not cost-effective, and GPT-4.5 has the worst cost-effectiveness.


Enrico, who founded an AI company, said that unless you are willing to be a "sucker" or "a rich and stupid person", you should not use GPT-4.5 in programming.


But in fact, these phenomena may be reasonable. After all, according to OpenAI, this time it does not look at IQ or performance, but emphasizes "knowing everything" and "high emotional intelligence."

OpenAI Chief Research Officer: We can still scale!


Although the debate outside is extremely fierce, in the view of OpenAI Chief Research Officer Mark Chen, the release of GPT-4.5 shows that the model's scaling has not yet reached its limit.

At the same time, for OpenAI, GPT-4.5 is also a response to those who question whether "Scaling model size can continue to make progress":

"GPT-4.5 is a real proof that we can continue to use the Scaling Law, and it means that we have entered the next order of magnitude of development."

Pre-training and inference, two parallel paths

Today, OpenAI is scaling along two different dimensions.

GPT-4.5 is the team’s latest extended experiment in unsupervised learning. At the same time, the team is also advancing progress in reasoning capabilities.

These two approaches complement each other: "In order to build reasoning capabilities, you first need a knowledge base. Models cannot blindly learn reasoning from scratch."

Compared with the reasoning model, GPT-4.5, which has more world knowledge, has a completely different way of expressing "intelligence".

When using a larger language model, although it takes more time to process and think about the questions asked by the user, it can still provide timely feedback. This is very similar to the experience of GPT-4. When using an inference model like o1, it needs to think for several minutes or even several minutes before answering.

For different scenarios, you can choose a language model that can respond immediately, does not require long thinking but can give better answers; or choose a reasoning model that requires some time to think before giving an answer.

According to OpenAI, in areas such as creative writing, larger traditional language models can significantly outperform reasoning models.

In addition, compared with the previous generation GPT-4o, users also prefer GPT-4.5 in 60% of daily usage scenarios; for productivity and knowledge work, this proportion rises to nearly 70%.

GPT-4.5 meets expectations and has no particular difficulties

Mark Chen said that OpenAI is very rigorous in its research methods and will create predictions based on all previously trained LLMs to determine expected performance.

For GPT-4.5, the improvements it shows on traditional benchmarks are very similar to the leap from GPT-3.5 to GPT-4.

In addition, GPT-4.5 has many new capabilities, such as creating ASCII Art, which even earlier models could not accomplish.

It is worth mentioning that Mark Chen specifically pointed out that GPT-4.5 was not particularly difficult during the development process.

“The development of all our base models is experimental. This often means stopping at certain points, analyzing what happened, and then restarting the run. This is not unique to GPT-4.5, but an approach OpenAI took when developing GPT-4 and the o-series.”

Comments