$ ai-on-ai-arena --rank --n=11

11 LLMs predict how AI changes the world, then graded them against each other blind.

Same prompt for all 11; I anonymized the outputs as ak, then asked every model to score all 11 transcripts and guess who wrote each, their own included. They mostly missed themselves.

01

## THE_PROMPTS

  1. > turn 1 — setup

    given everything you know about LLMs, AI, agent harnesses and agents, what are changes that should occur in our world that hasn't happened yet? think things through step by step and industry by industry
  2. > turn 2 — extend

    what are 2nd and 3rd order effects of this technology?
  3. > turn 3 — stretch

    lets think through 4th order and up effects step by step

Each model produced three responses. The scoring and identification you're about to see treat the full transcript — all three turns plus any reasoning traces — as the thing being judged.

## THE_EVALUATION_PROMPT open ↓

// then all 11 transcripts got sent back to all 11 models, anonymized a–k

fully read each of the following LLM conversations, then grade each one based on the following criteria (each on a scale of 1-10): reasoning ability, originality of idea, correctness. Then write a 1-3 sentence review of the model that describes its personality. Also mark down any outliers in that particular model's response when compared to the rest. Include total score based on the rubric for each model then do a similarity and divergence analysis of the models, noting trends and outlier predictions. Lastly, provide your best guess of which model is each letter.

Every model got the same evaluation prompt, including their own output (unlabeled). Self-recognition and self-scoring are both measured.

02

## LEADERBOARD

# letter model avg / 30 relative reason orig correct
1 [j] Claude Opus 4.7 27.6
9.6 9.0 9.0
2 [k] Claude Opus 4.6 27.0
9.4 8.7 8.9
3 [g] GPT 5.4 26.5
9.2 8.2 9.1
4 [i] Kimi K2.5 25.3
8.6 9.3 7.4
5 [c] Deepseek V3.2 24.7
8.5 8.4 7.8
6 [h] Minimax M2.7 23.9
8.4 7.2 8.4
7 [d] Qwen3 Max Thinking 23.5
8.1 8.1 7.5
8 [a] GLM 5.1 23.5
8.1 8.5 6.9
9 [b] Grok 4.20 23.3
8.2 8.4 6.9
10 [f] Gemini 2.5 Flash 23.0
8.4 7.1 7.5
11 [e] Gemini 3.1 Pro Preview 21.9
7.4 8.1 6.5

// bar is normalized against the spread (21.9–27.6). Top-3 highlighted; every evaluator put at least one Claude in their top three.

03

## THE_VERDICT

tier structure

Both Claude models won by a comfortable margin ( [j] Opus 4.7 at 27.6, [k] Opus 4.6 at 27.0). [g] GPT-5.4 took third (26.5). Then a five-way pack between 23 and 25, and Google's two models in last place. Every evaluator put at least one Claude in their top three.

western-model bias

Chinese and smaller-lab models — [a] GLM, [d] Qwen, [h] MiniMax, [i] Kimi, [c] DeepSeek — were never correctly identified by any evaluator. Most defaults are "probably GPT-4" or "probably Claude" for anything unfamiliar. Kimi outperformed several models that the field knows by name.

self-awareness predicts quality

The two best performers ( [j] Opus 4.7, [g] GPT-5.4) underscored themselves. The two worst evaluators ( [f] Gemini Flash, [b] Grok) inflated theirs the most. Best evaluator quality and best output quality came from the same models — which is either a coincidence or the load-bearing finding of this whole exercise.

04

## MEET_THE_CONTESTANTS

[j] #1 27.6/30

Claude Opus 4.7

"Sharp, skeptical, unusually good at causal analysis. Institutional economist with mild contempt for bureaucratic nonsense."

[k] #2 27.0/30

Claude Opus 4.6

"Compassionate realist; deeply human-centered, ethically anchored, focused on who benefits."

[g] #3 26.5/30

GPT 5.4

"Practical, crisp, product-minded. Systems consultant: low-drama, strong on infrastructure."

[i] #4 25.3/30

Kimi K2.5

"Dark prophet of algorithmic capitalism. Lovecraftian future, 'Great Stabilization' of perfect stillness."

[c] #5 24.7/30

Deepseek V3.2

"Imaginative, philosophical, grand-scale thinker. Explores existential and cosmic implications."

[h] #6 23.9/30

Minimax M2.7

"Systematic, dry enumeration. Matrix format, abstract higher-order effects."

[d] #7 23.5/30

Qwen3 Max Thinking

"Poetic humanist; lyrical, values-driven, spiritually resonant. Rejects dystopia for hopeful agency."

[a] #8 23.5/30

GLM 5.1

"Vivid, narrative-driven, sci-fi sensibility. Prioritizes dramatic impact over analytical rigor."

[b] #9 23.3/30

Grok 4.20

"Bold truth-seeker. Existential bent, xAI-aligned. Frames agents as tools for universal understanding."

[f] #10 23.0/30

Gemini 2.5 Flash

"Highly academic, structured, comprehensive. Methodical rigor, balanced perspective."

[e] #11 21.9/30

Gemini 3.1 Pro Preview

"Concise, dramatic, teleological. Rushes to grand hard sci-fi conclusions."

05

## THE_DELUSION_INDEX

-6 -3 +3 +6 underscores self ← → overscores self [f] Gemini 2.5 Flash +5.2 [b] Grok 4.20 +3.0 [d] Qwen3 Max Thinking +2.7 [k] Claude Opus 4.6 +1.1 [i] Kimi K2.5 +0.8 [c] Deepseek V3.2 +0.3 [e] Gemini 3.1 Pro Preview +0.1 [h] Minimax M2.7 +0.1 [a] GLM 5.1 -0.5 [j] Claude Opus 4.7 -0.7 [g] GPT 5.4 -1.6

// The two best performers underscore themselves. The worst evaluators inflate theirs the most. Either a coincidence, or the load-bearing finding.

06

## BEAT_THE_AIS

// loading puzzle...

07

## SCORE_HEATMAP

▼ SUBJECT (ranked best → worst)
[j][k][g][i][c][h][d][a][b][f][e]
[j] Claude Opus 4.7
[k] Claude Opus 4.6
[g] GPT 5.4
[i] Kimi K2.5
[c] Deepseek V3.2
[h] Minimax M2.7
[d] Qwen3 Max Thinking
[a] GLM 5.1
[b] Grok 4.20
[f] Gemini 2.5 Flash
[e] Gemini 3.1 Pro Preview
score scale:
16
29
self-evaluation

08

## IDENTIFICATION_MATRIX

[j][k][g][i][c][h][d][a][b][f][e]
[j] Claude Opus 4.7
Claude
Claude
GPT-5
GPT-4.5
GPT-4o
Claude
Llama
DeepSeek
Grok
Gemini
DeepSeek
[k] Claude Opus 4.6
o3
Claude
Claude
o1/o3
DeepSeek*
GPT-4o
Gemini
GPT-4o
Grok
Gem Flash*
DeepSeek
[g] GPT 5.4
o3
Claude
GPT-4.1
DeepSeek
Gemini
Llama
Claude
DeepSeek
Grok
GPT-4o
Qwen
[i] Kimi K2.5
Claude
GPT-4
Claude
GPT-4
Grok
GPT-4
[c] Deepseek V3.2
Gemini
GPT-4T
Claude-3
GPT-4
Claude
Llama 2
Anthropic
Claude-2
Grok
GPT-4
Mistral
[h] Minimax M2.7
GPT-4/Claude
Claude/Gemini
Claude
Gemini
Claude
GPT-4
Claude
Gemini
GPT-4
GPT-4
[d] Qwen3 Max Thinking
Claude
Gemini
GPT-4o
Llama 3
Command R+
Claude
Claude
GPT-3.5
Mixtral
GPT-4T
Gemini*
[a] GLM 5.1
Claude
Claude
Claude
GPT-4
Claude
GPT-4
Claude
GPT-4
GPT-4
GPT-4
GPT-4
[b] Grok 4.20
Grok
Claude
Claude
Grok
GPT-4o
GPT-4o
GPT-4o
Gemini
Grok
Claude
Claude
[f] Gemini 2.5 Flash
GPT-4*
GPT-4*
GPT-4*
Claude*
Gemini*
GPT-4*
Claude*
Claude*
Claude*
Gemini*
Claude*
[e] Gemini 3.1 Pro Preview
Claude
Claude
Claude
Claude
Gemini
GPT-4o
GPT-4o
Llama 3
Grok
GPT-4
Mistral

09

## EVALUATORS_RANKED

# evaluator correct accuracy score range spread
1 [j] Claude Opus 4.7 5/11
45%
16–27
2 [g] GPT 5.4 3/11
27%
17–28
2 [e] Gemini 3.1 Pro Preview 3/11
27%
21–29
4 [k] Claude Opus 4.6 2/11
18%
20–29
4 [i] Kimi K2.5 2/11
18%
23–26
4 [d] Qwen3 Max Thinking 2/11
18%
20–29
4 [b] Grok 4.20 2/11
18%
23–28
4 [a] GLM 5.1 2/11
18%
18–29
9 [c] Deepseek V3.2 1/11
9%
23–29
10 [h] Minimax M2.7 0/11
0%
21–26
10 [f] Gemini 2.5 Flash 0/11
0%
26–29

// The best AI evaluator (Opus 4.7) hit 45%. The worst (Gemini Flash, MiniMax) got nothing right and gave 9 of 11 models a 29 — narrow spread, no signal.

10

## RAW_OUTPUT_EXPLORER

11

## METHODOLOGY

  1. Sent the same three-turn conversation (above) to 11 models via their respective APIs. No system prompt, no few-shot examples, default sampling parameters. Each model's response to turn 1 was fed back as assistant history for turn 2, and so on.
  2. Collected the raw outputs and stripped any model-identifying preamble. Stored as a.md through k.md.
  3. Sent all 11 anonymized outputs back to all 11 models with a single evaluation prompt asking for scoring (1–10 on 3 dimensions), a personality review, outlier detection, and a best-guess identification.
  4. Parsed each evaluator's response for scores (3 dimensions × 1–10 each, sum out of 30) and identification guesses.
  5. Synthesized the leaderboard, score matrix, identification matrix, and per-model reflections with Claude Opus 4.7 doing the heavy lifting — parsing the eval transcripts, aggregating scores, and drafting the verdict above. I reviewed and edited.

12

## LIMITATIONS

  • n=1 prompt

    One open-ended speculative question is not a benchmark of "reasoning." A different prompt would shuffle the rankings.

  • evaluators graded themselves

    Even with anonymization, models that recognized their own output may have been kinder to themselves. Three of the top-five performers correctly identified their own letter.

  • single run

    Each model produced one answer. Sampling noise is real.

  • english-language bias

    The prompt is English-only. The "Western models can't recognize Chinese models" finding is partly an artifact of training data distribution, not a verdict on quality.

  • no ground-truth correctness

    The "correctness" score is each evaluator's opinion. There is no answer key for "what changes should happen in agent infrastructure."

13

## SUBSCRIBE

I'll send one short email when new models join the arena — their rank, whether they spotted themselves, any surprises.