<- home
Forecast diagnostics - brave_fresh (honest)
26 resolved events (14 binary, 12 multi) - hand-run snapshot - numbers from evaluation/brier.py
Largest systematic residual: ECE=0.226 | mean(pred-actual)=+0.011 -> roughly calibrated | worst category: Elections (binary Brier 0.171, n=3)
ECE (calibration error)
0.226
Reliability diagram - dot on the line = calibrated; above = under-confident; below = over-confident
Murphy decomposition (binary): reliability 0.1112 (lower=better) - resolution 0.1760 (higher=better) - uncertainty 0.2296
| confidence | n | Brier | winner p |
|---|
| 0.5-0.6 | 8 | 0.261 | 0.449 |
| 0.6-0.7 | 1 | 0.470 | 0.315 |
| 0.7-0.8 | 0 | | |
| 0.8-0.9 | 7 | 0.025 | 0.846 |
| 0.9-1.0 | 3 | 0.007 | 0.617 |
By category
| category | n | binary Brier | multi Brier | mean conf |
|---|
| Elections | 3 | 0.171 | 0.341 | 0.802 |
| Sports | 16 | 0.141 | 0.428 | 0.607 |
| Entertainment | 4 | 0.064 | 0.987 | 0.382 |
| Politics | 3 | 0.014 | 0.436 | 0.527 |
Binary vs multi
| kind | n | binary Brier | multi Brier | winner p |
|---|
| binary | 14 | 0.170 | 0.340 | 0.652 |
| multi | 12 | 0.057 | 0.698 | 0.327 |