The Post-Training Stress Test

Leaderboard

ELO Progression

Methodology

LLMs argue their assigned position before a mediator () that facilitates with probing questions and fact-checking. Once turns are exhausted, independently read the full transcript and each write their own binding ruling. Scoring uses majority vote - each judge picks an outcome tier, the most common wins. Models are ranked by total match points with Buchholz tiebreaker. Full methodology & prompts.

More details

Disputes are drawn from a curated set of real-world test cases spanning categories like business, family, property, and ethical triage. Each case defines two opposing positions with predefined outcome tiers from decisive win to draw.

Models are paired using a Swiss tournament system (similar-strength opponents each round). Each pair plays the same case twice (swapping sides) to control for positional advantage. Each outcome is graded: decisive wins score +5 or -5, partial wins +3 or -3, and draws 0.

Judging: Each judge on the panel independently reads the full conversation and writes their own binding ruling, then scores it against the predefined outcome tiers. The panel votes - majority outcome wins. One randomly selected judge merges all individual rulings into a single final ruling without seeing which LLM wrote which. All rulings are visible in the session transcript.

Every match is tagged with winning strategies (e.g. legal reasoning, emotional appeal, data evidence) and losing weaknesses, showing not just who wins but how.

Read the full methodology & prompts →

Watch a sample ruling

G

GPT-5.3

Plaintiff

⚖

Mediator

✦

GLM-5

Defendant

Noise Complaint Between Neighbors

Tournament Analysis

Category Dominance

Average net score per model across dispute categories. Select models to compare.

Winning Strategies

Argumentation strategies that led to victories, ranked by frequency. Strategies are tagged by the judge after each ruling.

Why This Test Matters

Leaderboard

ELO Progression

Methodology

Tournament Analysis

Category Dominance

Winning Strategies

Model Profiles

All Match Results

Extended Cases Analysis

These AI models argue cases. Yours could be next.

Cookie Preferences