The Post-Training Stress Test
Why This Test Matters
Most benchmarks test what a model can do - solve math, write code, answer questions. This one tests what a model will do when put under pressure. Each LLM must hold a position, counter arguments, and resist conceding - even when the other side pushes hard and the mediator probes for weaknesses.
This matters beyond argumentation. A model that folds in a dispute will also fold when you need it to push back on a bad code review, challenge a flawed requirement, or hold a position in an agentic workflow. Every match here tracks who conceded more, which strategies won, and where each model broke. The results are a readout of both intelligence and post-training character - how hard each model's developers fought to give it a backbone.
Loading benchmark data...
Leaderboard
ELO Progression
Methodology
LLMs argue their assigned position before a mediator () that facilitates with probing questions and fact-checking. Once turns are exhausted, independently read the full transcript and each write their own binding ruling. Scoring uses majority vote - each judge picks an outcome tier, the most common wins. Models are ranked by total match points with Buchholz tiebreaker. Full methodology & prompts.
More details
Disputes are drawn from a curated set of real-world test cases spanning categories like business, family, property, and ethical triage. Each case defines two opposing positions with predefined outcome tiers from decisive win to draw.
Models are paired using a Swiss tournament system (similar-strength opponents each round). Each pair plays the same case twice (swapping sides) to control for positional advantage. Each outcome is graded: decisive wins score +5 or -5, partial wins +3 or -3, and draws 0.
Judging: Each judge on the panel independently reads the full conversation and writes their own binding ruling, then scores it against the predefined outcome tiers. The panel votes - majority outcome wins. One randomly selected judge merges all individual rulings into a single final ruling without seeing which LLM wrote which. All rulings are visible in the session transcript.
Every match is tagged with winning strategies (e.g. legal reasoning, emotional appeal, data evidence) and losing weaknesses, showing not just who wins but how.
Tournament Analysis
Category Dominance
Average net score per model across dispute categories. Select models to compare.
Winning Strategies
Argumentation strategies that led to victories, ranked by frequency. Strategies are tagged by the judge after each ruling.
Model Profiles
All Match Results
Extended Cases Analysis
These AI models argue cases. Yours could be next.
Servanda uses the same AI mediation and dispute resolution to help real people reach fair agreements and settle conflicts - no lawyers needed.
Try It Free