Side-by-side gallery · resolved events

What you're looking at: each row is a resolved event and each model cell is its through-pipeline loss; green cells and low footer means are good, and clicking a row opens the probabilities, rationale, and evidence URLs.

26 events from PA's sample-resolved set, same retrieval + prompt + longshot floor, five LLM swaps. Each cell shows model's p(outcomes[0] wins) and the resulting single-binary Brier (Prophet Arena CLI metric). Color: green = low loss, red = high loss. Click any row for full per-outcome probabilities, rationales, and evidence URLs.

Leakage note: these per-event Briers come from unfiltered retrieval, so the footer means (~0.038 for production) are best-case-with-hindsight — replay retrieval can surface post-resolution articles. Every model got the same hindsight benefit, so the cross-model ranking holds, but our honest headline is the leakage-disciplined binary Brier 0.118 (date-capping retrieval to each event's close; 3.1× hindsight inflation, paired bootstrap n=26, 95% CI [−0.136, −0.030]). See the summary report for the full audit.

How to read this page

Each column is the same pipeline with a different LLM. Production (Opus 4.7) outlined in green. All other models run through identical retrieval, prompt, and post-processing - only the LLM call swaps.
Per-cell: top number is the model's probability that outcomes[0] wins (PA's convention). Bottom is the resulting single-binary Brier on this event. Background color: dark green < 0.02, light green < 0.08, yellow < 0.16, orange < 0.30, red >= 0.30.
Striped err cells mean that model failed JSON parsing on this event; the underlying value is a uniform-prior fallback. The footer mean still counts that fallback because it is the model's actual through-pipeline behavior on the resolved set.
"best Brier" column: which model scored lowest single-binary Brier on this event. "tie" when within 0.001 - common on binary events where everyone correctly picks 0.90/0.10.
Multi-class vs single-binary: PA's CLI evaluator scores single-binary; their docs describe proper multi-class. The drill-down modal shows all per-outcome probabilities so you can compute either. summary report has the dual-metric leaderboard.

open events gallery (no actuals) · summary report · home

event	best Brier	Opus 4.7 production	Opus 4.6	GPT-5.2	GPT-5.5	Gemini 3.1 Pro
Who became Prime Minister of Hungary after the 2026 election? Elections · n_outcomes=2 · winner: Péter Magyar click row	tie	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010
Who won the 2026 Democratic primary for Ohio's 15th Congressional District? Elections · n_outcomes=2 · winner: Don Leonard click row	tie	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010
Who won the 2026 Democratic primary for West Virginia's 1st Congressional District? Elections · n_outcomes=2 · winner: Vince George click row	tie	0.10B=0.010	0.10B=0.010	0.10B=0.010	0.10B=0.010	0.10B=0.010
Which contestants were eliminated from Survivor Season 50 Episode 7? Entertainment · n_outcomes=14 · winner: Dee Valladares click row	tie	0.05B=0.003	0.05B=0.003	0.05B=0.003	0.05B=0.003	errB=0.005
Who was Netflix's next live roast subject after Tom Brady? Entertainment · n_outcomes=11 · winner: Kevin Hart click row	tie	0.50B=0.250	0.50B=0.250	0.50B=0.250	errB=0.826	0.50B=0.250
Who won The Masked Singer Season 14? Entertainment · n_outcomes=18 · winner: Galaxy Girl click row	tie	0.05B=0.003	0.05B=0.003	errB=0.003	errB=0.003	errB=0.003
Who won Tournament of Champions Season 7 (Food Network)? Entertainment · n_outcomes=20 · winner: Bryan Voltaggio click row	tie	0.05B=0.003	0.05B=0.003	0.05B=0.003	errB=0.003	errB=0.003
How many Supreme Court justices voted in favor of Louisiana in Louisiana v. Callais (2026)? Politics · n_outcomes=10 · winner: 3 click row	Opus 4.6	0.10B=0.010	0.05B=0.003	errB=0.010	errB=0.010	errB=0.010
How many US Senators voted for Trump's Federal Reserve Chair pick (2027 confirmation vote)? Politics · n_outcomes=12 · winner: 54 click row	tie	0.05B=0.003	0.05B=0.003	0.05B=0.003	0.05B=0.003	errB=0.007
Which coalition won the 2026 Colombia Senate election? Politics · n_outcomes=2 · winner: Government Alliance click row	Opus 4.6	0.82B=0.032	0.88B=0.014	0.72B=0.078	0.82B=0.032	0.40B=0.360
Who won the 2025-26 La Liga (Spanish top-flight football)? Sports · n_outcomes=20 · winner: Barcelona click row	tie	0.05B=0.003	0.05B=0.003	0.05B=0.003	0.05B=0.003	errB=0.003
Who won the 2025-26 Liga Portugal (Portuguese top-flight football)? Sports · n_outcomes=18 · winner: FC Porto click row	tie	0.05B=0.003	0.05B=0.003	0.05B=0.003	0.05B=0.003	errB=0.003
Who won the 2025-26 Ligue 1 (French top-flight football)? Sports · n_outcomes=18 · winner: PSG click row	Opus 4.7	0.85B=0.023	0.75B=0.062	0.78B=0.048	errB=0.892	errB=0.892
Who won the 2025-26 NHL Calder Memorial Trophy (Rookie of the Year)? Sports · n_outcomes=30 · winner: Matthew Schaefer click row	tie	0.05B=0.003	0.05B=0.003	0.05B=0.003	0.05B=0.003	errB=0.003
Who won the 2025-26 Serie A (Italian top-flight football)? Sports · n_outcomes=20 · winner: Inter click row	GPT-5.2	0.22B=0.048	0.18B=0.032	0.14B=0.020	0.28B=0.078	errB=0.003
Who won the Bangladesh vs Pakistan men's Test cricket match starting May 8, 2026? Sports · n_outcomes=2 · winner: Bangladesh click row	tie	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010
Who won the Bax vs Arcon ATP Challenger tennis match on May 5, 2026? Sports · n_outcomes=2 · winner: Florent Bax click row	Opus 4.6	0.55B=0.202	0.75B=0.062	0.52B=0.230	0.62B=0.144	0.50B=0.250
Who won the Breda vs Heerenveen Eredivisie football match on May 10, 2026? Sports · n_outcomes=3 · winner: Breda click row	tie	0.80B=0.040	0.80B=0.040	0.80B=0.040	0.80B=0.040	0.80B=0.040
Who won the Glamorgan vs Somerset 2026 County Championship cricket match? Sports · n_outcomes=2 · winner: Glamorgan click row	tie	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010
Who won the Los Angeles Lakers vs Oklahoma City NBA Round 2 playoff series in 2026? Sports · n_outcomes=2 · winner: Oklahoma City click row	tie	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010	0.90B=0.010
Who won the Najzer vs Ebster tennis match in the 2026 W15 Klagenfurt Round of 32? Sports · n_outcomes=2 · winner: Anna Lena Ebster click row	tie	0.10B=0.010	0.10B=0.010	0.10B=0.010	0.10B=0.010	0.10B=0.010
Who won the Perez vs Lalami Laaroussi ATP Challenger tennis match on May 5, 2026? Sports · n_outcomes=2 · winner: Younes Lalami Laaroussi click row	tie (Opus 4.7, GPT-5.2, GPT-5.5, Gemini 3.1 Pro)	0.10B=0.010	0.25B=0.062	0.10B=0.010	0.10B=0.010	0.10B=0.010
Who won the Rocha vs Johns ATP Challenger tennis match on May 10, 2026? Sports · n_outcomes=2 · winner: Garrett Johns click row	tie (Opus 4.7, Opus 4.6, GPT-5.2, Gemini 3.1 Pro)	0.10B=0.010	0.10B=0.010	0.10B=0.010	0.12B=0.014	0.10B=0.010
Who won the Sussex vs Leicestershire 2026 County Championship cricket match? Sports · n_outcomes=2 · winner: Sussex click row	Opus 4.7	0.90B=0.010	0.57B=0.181	0.60B=0.160	0.88B=0.014	errB=0.250
Who won the Watson vs Okamura WTA Challenger tennis match on May 5, 2026? Sports · n_outcomes=2 · winner: Heather Watson click row	Gemini 3.1 Pro	0.50B=0.250	0.55B=0.202	0.57B=0.185	0.52B=0.230	0.70B=0.090
Who won the Worcestershire vs Durham 2026 County Championship cricket match? Sports · n_outcomes=2 · winner: Durham click row	tie	0.10B=0.010	0.10B=0.010	0.10B=0.010	0.10B=0.010	0.10B=0.010
mean Brier (lower is better; best-case w/ hindsight; honest = 0.118)	-	0.038hindsight mean	0.0391mean	0.0438mean	0.0920mean	0.0873mean