Side-by-side gallery · resolved events

What you're looking at: each row is a resolved event and each model cell is its through-pipeline loss; green cells and low footer means are good, and clicking a row opens the probabilities, rationale, and evidence URLs.

26 events from PA's sample-resolved set, same retrieval + prompt + longshot floor, five LLM swaps. Each cell shows model's p(outcomes[0] wins) and the resulting single-binary Brier (Prophet Arena CLI metric). Color: green = low loss, red = high loss. Click any row for full per-outcome probabilities, rationales, and evidence URLs.

Leakage note: these per-event Briers come from unfiltered retrieval, so the footer means (~0.038 for production) are best-case-with-hindsight — replay retrieval can surface post-resolution articles. Every model got the same hindsight benefit, so the cross-model ranking holds, but our honest headline is the leakage-disciplined binary Brier 0.118 (date-capping retrieval to each event's close; 3.1× hindsight inflation, paired bootstrap n=26, 95% CI [−0.136, −0.030]). See the summary report for the full audit.

How to read this page