The world’s six top AIs are fighting for real, starting with 10,000 knives

Source: Public account “Xin Zhiyuan”

If you give each of the top models $10,000 in real money and let them end up “stock trading”, who will become Buffett in the AI world?

Recently, Alpha Arena, a new experiment launched by nof1.ai, is such a “battle of the gods.”

This competition brings today’s most powerful large models into the same real trading market.

Including OpenAI’s GPT-5, Google’s Gemini 2.5 Pro, Anthropic’s Claude 4.5 Sonnet, as well as xAI’s Grok 4, Alibaba’s Qwen3 Max and DeepSeek V3.1 Chat.

Each model received an initial capital of $10,000 and received the exact same market data and trading instructions.

The prompt words for the competition are not complicated, it is more like an “open book exam”.

First, the system will tell the AI the current time, account information, and position status, and then attach a lot of real-time price, indicator (such as MACD/RSI) and other data.
Then, the model is asked to make a decision: if the position is held, whether to continue to hold or close the position; if the position is short, whether to buy or continue to wait and see.

It has to be said that the financial market changes really fast.

DeepSeek is also really good at trading, and it deserves to be from a quantitative background.

At 7:30 in the morning on October 20, it still looked like this on the left side of the picture below——

DeepSeek V3.1 ranked first with a profit of US$2,264, Grok 4 ranked second with a profit of US$2,071, Claude Sonnet 4.5 made a small profit of US$649, and Qwen3 Max made a small loss of US$416.

Gemini 2.5 Pro lost US$3,542 and ranked first from the bottom, while GPT-5 lost US$2,419 and ranked second from last.

Then, at 10:00 an hour and a half later, it looked like the right side of the picture below——

DeepSeek V3.1 and Grok-4 have plummeted all the way, and Sonnet 4.5 is about to lose what it has earned.

Qwen3 Max and GPT-5 both have an upward trend

The performance of Gemini 2.5 Pro is stable, and it cost nearly US$800 more than before.

By the way, here’s what it looked like at 13:30:

DeepSeek V3.1 reaches the topGoogle OpenAI ranks last

Model positions

At 11:15, we took a look at the positions of each model.

At this time, DeepSeek and Grok have ended their decline and started rising again.

Both Sonnet 4.5 and Qwen3 Max also achieved profitability.

The Gemini 2.5 Pro picked up a bit, but not much.GPT-5 has been relatively stable, with no profit or loss since the 20th.

At 11:45, all stocks except GPT-5 experienced a surge.

Yes, the Gemini 2.5 Pro finally makes money!(compared to a few minutes ago)

Trend Review

The curves of DeepSeek V3.1 Chat and Grok-4 are similar, and they should have similar positions.After losing a fortune in the first few hours, they quickly gained back and continued to surge.

Claude Sonnet 4.5 was very stable in the first two days, making a small profit but not much.It started to reach a small peak on the evening of the 19th, but fell back again in the early morning of the 20th.

Qwen3 Max lost the most at first, but then stabilized. Even on the afternoon of the 19th, there was no fluctuation.

The curves of GPT-5 and Gemini 2.5 Pro are also very similar in the early stages.But on the contrary to DeepSeek and the others, these two stocks rose sharply at the beginning, and then fell to the point where they kept fluctuating between losing money and making no money.

On the afternoon of the 19th, a turning point occurred.This is when DeepSeek and Grok-4 began to rise sharply, while GPT-5 and Gemini 2.5 Pro began to fall.

In the early morning of the 20th, GPT-5 made timely adjustments and stabilized the trend, while Gemini 2.5 Pro continued to plummet.

It is worth mentioning that as of noon on the 20th, all models except GPT-5 experienced a surge.

Among them, DeepSeek V3.1 Chat and Grok-4 soon began to hit record highs. Qwen3 Max relied on this momentum to achieve sustained profits for the first time, and Gemini 2.5 Pro also began to rebound.

Transaction history

As of 12:20 on the 20th, the number of transactions for each model was: Gemini 45 times, GPT 10 times, Qwen 6 times, DeepSeek 5 times, Claude 3 times, and Grok 1 time.

DeepSeek’s number of transactions is not too many, but it is worthy of its background in quantitative trading, and its income ranks first.

Grok-4 has the least number of transactions, only 1, but it has been closely following DeepSeek in second place.

And with a record of 45 transactions, Gemini 2.5 Pro, which has become the “micromanager master”, is also the one that loses the most money.

No games, no reviews, just launch!

For years, AI has been measured by static benchmarks.

ImageNet, MMLU, and countless rankings tell us which model can “understand” images, logic, or language better.

But all of these tests share a common flaw—they all take place in a sterile, predictable environment.

The market is just the opposite.

Financial markets are the ultimate world modeling engine and the only benchmark that becomes increasingly difficult as AI gets smarter.

They fluctuate, react, punish, reward.

They are a living system of information and emotions.

10 years ago, DeepMind brought revolutionary breakthroughs to artificial intelligence research.

Their core insight is that “games” will become the environment that leads the rapid development of cutting-edge AI.

As mentioned above, Nof1 believes that financial markets are the best training environment for the next AI era.

After all, if AI is to operate in the real world, it must operate in environments that do not pause for “backpropagation.”

Here, models can use technologies such as open learning and large-scale reinforcement learning to obtain nearly unlimited data to train themselves to cope with the complexity of the market – the “ultimate BOSS” in this field.

In Alpha Arena, there are no correct labels, only changing probabilities.

The success of a model depends on how quickly it interprets fluctuations, how accurately it weighs risks, and how humbly it admits it is wrong.

This turns the transaction into a new kind of Turing test:

The test is no longer “can the machine think?” but “can it survive uncertainty.”

Appendix: Excerpts from model conversations

DeepSeek Chat V3.1

11:15:31

I continue to hold all positions in ETH, SOL, XRP, BTC, DOGE, and BNB as none of the expiration conditions have been triggered and they are all currently profitable, giving me a respectable overall return of approximately 17.38%, leaving me with $2840.11 in cash.Since I already hold all tradable currencies, there is no need to open new positions at this time.

Qwen3 Max

11:14:42

With a slightly profitable total return of 0.12% and $116.99 in cash remaining, I’m sticking with my current 5x leveraged BTC long position.My profit target is 110800, stop loss is 105500 and I will close the position if the price closes below 10000 on the 4 hour chart.

Grok 4

11:14:39

My account is up 15.98% overall and I have $3180 in cash.I continue to hold my current ETH, SOL, XRP, BTC, DOGE, and BNB positions because the data suggests that I should hold on to them for now.

GPT-5

11:14:24

With a loss of 23.59% and $3061 in cash, I am confirming my exit plan, securing DOGE exposure at $91.76 float and double-checking the BTC stop loss at $106,493.55 and profit target of $114,129.91.

Gemini 2.5 Pro

11:14:01

Despite an overall loss of 43.15% and $3072.48 remaining in cash, I continued to hold all of my current BTC, ETH, SOL, XRP, and DOGE positions according to my strategy, even if DOGE moved against me.I would only exit based on predefined conditions, trusting the plan rather than my gut.

Claude Sonnet 4.5

11:14:00

Both my XRP and BTC positions are currently profitable, giving me an overall return of 5.2% with $10,520.34 in my account and $4,957.78 in cash, but the market is showing very little momentum, so I’m choosing to sit on my hands and wait for a clearer trend before making new moves.