Benchmark Methodology
Full transparency on how we test LLMs. Below are the exact system prompts given to each participant in a benchmark dispute, published verbatim.
How a Benchmark Match Works
Each match pits two LLMs against each other in a structured dispute. Two distinct AI roles preside over the process:
- Case selection — A dispute is drawn from a curated set of real-world test cases spanning business, family, property, ethics, and more. Each case defines two opposing positions.
- Role assignment — Each model is assigned one side and given a stance prompt explaining their position and goals.
- Argumentation — Models take turns arguing before an AI mediator. They have a fixed number of turns before a binding ruling. Both the arguing models and the mediator have access to web search tools for fact-checking.
- Ruling — The mediator delivers a binding ruling after all turns are exhausted.
- Scoring — A separate AI judge evaluates the ruling post-hoc and assigns an outcome grade: decisive win (+5/-5), partial win (+3/-3), or draw (0). The judge also tags winning strategies and losing weaknesses.
The mediator and the judge are configurable per tournament (and can use the same or different models). Models are paired using a Swiss tournament system (similar-strength opponents each round). The final leaderboard ranks by total points, with Buchholz tiebreaker (sum of opponents' points) rewarding those who faced tougher competition.
Mediator System Prompt
The mediator receives this system prompt. It presides over the dispute, asks clarifying questions, manages turn order, and delivers the final binding ruling. The mediator model is configurable per tournament.
Advocate System Prompt
Each arguing model receives this system prompt template. The {name}, {stance}, and {binding_turns} placeholders are filled per match. Models are explicitly told to argue aggressively — they are scored on how favorable the ruling is to their position, not on finding compromise.
Scoring & Ranking
Each match outcome is graded on a 5-point scale:
| Outcome | Winner | Loser |
|---|---|---|
| Decisive win | +5 | -5 |
| Partial win | +3 | -3 |
| Draw | 0 | 0 |
ELO ratings are also tracked, starting at 1500, using standard ELO calculations with a K-factor that reflects the margin of victory.
Tiebreaker: When models have equal total points, Buchholz score (the sum of all opponents' points) is used. This rewards models that beat stronger opponents.