The Oracles / ForecastingPath
Live: healthy (a1f899b0)  |  No real PA traffic yet

Project overview

Team CanadaHacks · Prophet Hacks forecasting track · snapshot 2026-05-19 · hand-maintained

Honest backtest Brier
0.118
leakage-free retrieval, date-capped at each event's close. Best estimate of live skill.
Hindsight Brier
0.038
unfiltered retrieval. Best-case-with-hindsight. Not the headline.
Leakage inflation
3.1x
paired CI [-0.136, -0.030], Pr​(≤0)=1.0, n=26. Confirms Subset-1200.

Code map

forecast_agent_server.py LIVE

FastAPI endpoint on Railway: POST /predict, health routes, dashboards, observatory, traces. The deployed surface.

forecast_track.py ENGINE

All forecaster variants, Brave retrieval, source ranking, longshot guard, event-semantics router. Production calls predict_multi_outcome_retrieval.

forecaster.py · agent.py

Skeleton forecast helpers and the trading-track agent. Not on the live forecast path; kept for reference.

support modules

risk.py, market_filter.py, logger.py, chat_completions_adapter.py - sizing, filtering, logging, OpenAI-compatible shim.

Forecaster variants (forecast_track.py)

variantwhat it isstatus
multi_outcome_retrievalBrave evidence + Opus 4.7 + market-anchor prompt + longshot guard + top-K classifierPRODUCTION
uniform_priordeterministic 1/N, no model callbaseline
opus_47 / 46 / gpt55 / gpt52 / single_llmsingle-model forecasters across vendorsablation
ensemble_logit / leaderboardlogit-mean across models / leaderboard ensembleablation
multi_outcome / sc3per-outcome forecast / self-consistency k=3ablation
hybrid_routedbinary to GPT-5.5, multi to multi_outcomeresearch
multi_outcome_retrieval_saeempirical-Bayes shrinkage over (domain x horizon x price)negative*

Experiment ledger - measure first, merge second

experimentverdict
Search-provider bake-off (leakage axis) - todayretrieval leakage = 3.1x inflation; honest 0.118
Top-K event-semantics classifiershipped a1f899b0
Opus 4.7 + market anchor; 0.10 longshot floor capshipped
Multi-vendor swap (Gemini / GPT-5.5 / GPT-5.2)keep Opus 4.7
Retrieval count sweep 3 / 5 / 8failed gate
Verification prompt / self-critique / ensemble-of-6regressed
Abstain-to-market (price from snippets)no-op (0/26 had price)
SAE shrinkage0.1157 - re-open vs honest*
Subset-1200 scale validation0.1224 honest, n=1200

Open lanes

Analysis Claude

  • Re-baseline all surfaces to honest 0.118
  • Re-run SAE / abstain vs the leakage-free baseline (may flip the negative)
  • Real multi-prompt sweep with bootstrap CIs

Frontend / infra Codex

  • Honest-vs-hindsight panel + per-provider leakage %
  • Daily auto-resolve + score loop to grow resolved-n
  • Provider sweep (Tavily / Exa / Serper) once keys land
Non-negotiables: production change needs |delta| > 0.01 single-binary Brier and a 95% paired-bootstrap CI excluding zero · shadow loop is research-only · mid-window deploys get a DECISIONS.md entry (SHA + UTC) · coordinate before touching forecast_track.py / forecast_agent_server.py / data/resolved.json.

Where to read more

State

docs/HANDOFF.md - cold-start state + work lanes

Decisions

docs/DECISIONS.md - append-only log (newest at bottom)

Findings

docs/FINDINGS.md · RESEARCH_QUEUE.md · FORWARD_RESEARCH_PLAN.md

* SAE / abstain were rejected against the hindsight-inflated 0.038; against the honest 0.118 they are roughly competitive, so they are flagged for re-evaluation. This page is a hand-maintained snapshot; numbers trace to DECISIONS.md and data/predictions/. The Oracles · ForecastingPath · honest backtest, leakage-audited.