26 events from PA's sample-resolved set, same retrieval + prompt + longshot floor, five LLM swaps. Each cell shows model's p(outcomes[0] wins) and the resulting single-binary Brier (Prophet Arena CLI metric). Color: green = low loss, red = high loss. Click any row for full per-outcome probabilities, rationales, and evidence URLs.
Leakage note: these per-event Briers come from unfiltered retrieval, so the footer means (~0.038 for production) are best-case-with-hindsight — replay retrieval can surface post-resolution articles. Every model got the same hindsight benefit, so the cross-model ranking holds, but our honest headline is the leakage-disciplined binary Brier 0.118 (date-capping retrieval to each event's close; 3.1× hindsight inflation, paired bootstrap n=26, 95% CI [−0.136, −0.030]). See the summary report for the full audit.
err cells mean that model failed JSON parsing on this event; the underlying value is a uniform-prior fallback. The footer mean still counts that fallback because it is the model's actual through-pipeline behavior on the resolved set.open events gallery (no actuals) · summary report · home
| event | best Brier | Opus 4.7 production | Opus 4.6 | GPT-5.2 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|---|---|
Who became Prime Minister of Hungary after the 2026 election? | tie | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 |
Who won the 2026 Democratic primary for Ohio's 15th Congressional District? | tie | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 |
Who won the 2026 Democratic primary for West Virginia's 1st Congressional District? | tie | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 |
Which contestants were eliminated from Survivor Season 50 Episode 7? | tie | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | errB=0.005 |
Who was Netflix's next live roast subject after Tom Brady? | tie | 0.50B=0.250 | 0.50B=0.250 | 0.50B=0.250 | errB=0.826 | 0.50B=0.250 |
Who won The Masked Singer Season 14? | tie | 0.05B=0.003 | 0.05B=0.003 | errB=0.003 | errB=0.003 | errB=0.003 |
Who won Tournament of Champions Season 7 (Food Network)? | tie | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | errB=0.003 | errB=0.003 |
How many Supreme Court justices voted in favor of Louisiana in Louisiana v. Callais (2026)? | Opus 4.6 | 0.10B=0.010 | 0.05B=0.003 | errB=0.010 | errB=0.010 | errB=0.010 |
How many US Senators voted for Trump's Federal Reserve Chair pick (2027 confirmation vote)? | tie | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | errB=0.007 |
Which coalition won the 2026 Colombia Senate election? | Opus 4.6 | 0.82B=0.032 | 0.88B=0.014 | 0.72B=0.078 | 0.82B=0.032 | 0.40B=0.360 |
Who won the 2025-26 La Liga (Spanish top-flight football)? | tie | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | errB=0.003 |
Who won the 2025-26 Liga Portugal (Portuguese top-flight football)? | tie | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | errB=0.003 |
Who won the 2025-26 Ligue 1 (French top-flight football)? | Opus 4.7 | 0.85B=0.023 | 0.75B=0.062 | 0.78B=0.048 | errB=0.892 | errB=0.892 |
Who won the 2025-26 NHL Calder Memorial Trophy (Rookie of the Year)? | tie | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | 0.05B=0.003 | errB=0.003 |
Who won the 2025-26 Serie A (Italian top-flight football)? | GPT-5.2 | 0.22B=0.048 | 0.18B=0.032 | 0.14B=0.020 | 0.28B=0.078 | errB=0.003 |
Who won the Bangladesh vs Pakistan men's Test cricket match starting May 8, 2026? | tie | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 |
Who won the Bax vs Arcon ATP Challenger tennis match on May 5, 2026? | Opus 4.6 | 0.55B=0.202 | 0.75B=0.062 | 0.52B=0.230 | 0.62B=0.144 | 0.50B=0.250 |
Who won the Breda vs Heerenveen Eredivisie football match on May 10, 2026? | tie | 0.80B=0.040 | 0.80B=0.040 | 0.80B=0.040 | 0.80B=0.040 | 0.80B=0.040 |
Who won the Glamorgan vs Somerset 2026 County Championship cricket match? | tie | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 |
Who won the Los Angeles Lakers vs Oklahoma City NBA Round 2 playoff series in 2026? | tie | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 | 0.90B=0.010 |
Who won the Najzer vs Ebster tennis match in the 2026 W15 Klagenfurt Round of 32? | tie | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 |
Who won the Perez vs Lalami Laaroussi ATP Challenger tennis match on May 5, 2026? | tie (Opus 4.7, GPT-5.2, GPT-5.5, Gemini 3.1 Pro) | 0.10B=0.010 | 0.25B=0.062 | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 |
Who won the Rocha vs Johns ATP Challenger tennis match on May 10, 2026? | tie (Opus 4.7, Opus 4.6, GPT-5.2, Gemini 3.1 Pro) | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 | 0.12B=0.014 | 0.10B=0.010 |
Who won the Sussex vs Leicestershire 2026 County Championship cricket match? | Opus 4.7 | 0.90B=0.010 | 0.57B=0.181 | 0.60B=0.160 | 0.88B=0.014 | errB=0.250 |
Who won the Watson vs Okamura WTA Challenger tennis match on May 5, 2026? | Gemini 3.1 Pro | 0.50B=0.250 | 0.55B=0.202 | 0.57B=0.185 | 0.52B=0.230 | 0.70B=0.090 |
Who won the Worcestershire vs Durham 2026 County Championship cricket match? | tie | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 | 0.10B=0.010 |
| mean Brier (lower is better; best-case w/ hindsight; honest = 0.118) | - | 0.038hindsight mean | 0.0391mean | 0.0438mean | 0.0920mean | 0.0873mean |