Project overview

Team CanadaHacks · Prophet Hacks forecasting track · snapshot 2026-05-19 · hand-maintained

Honest backtest Brier

0.118

leakage-free retrieval, date-capped at each event's close. Best estimate of live skill.

Hindsight Brier

0.038

unfiltered retrieval. Best-case-with-hindsight. Not the headline.

Leakage inflation

3.1x

paired CI [-0.136, -0.030], Pr(≤0)=1.0, n=26. Confirms Subset-1200.

Code map

`forecast_agent_server.py` LIVE

FastAPI endpoint on Railway: POST /predict, health routes, dashboards, observatory, traces. The deployed surface.

`forecast_track.py` ENGINE

All forecaster variants, Brave retrieval, source ranking, longshot guard, event-semantics router. Production calls predict_multi_outcome_retrieval.

`forecaster.py` · `agent.py`

Skeleton forecast helpers and the trading-track agent. Not on the live forecast path; kept for reference.

support modules

risk.py, market_filter.py, logger.py, chat_completions_adapter.py - sizing, filtering, logging, OpenAI-compatible shim.

Forecaster variants (forecast_track.py)

variant	what it is	status
multi_outcome_retrieval	Brave evidence + Opus 4.7 + market-anchor prompt + longshot guard + top-K classifier	PRODUCTION
uniform_prior	deterministic 1/N, no model call	baseline
opus_47 / 46 / gpt55 / gpt52 / single_llm	single-model forecasters across vendors	ablation
ensemble_logit / leaderboard	logit-mean across models / leaderboard ensemble	ablation
multi_outcome / sc3	per-outcome forecast / self-consistency k=3	ablation
hybrid_routed	binary to GPT-5.5, multi to multi_outcome	research
multi_outcome_retrieval_sae	empirical-Bayes shrinkage over (domain x horizon x price)	negative*

Experiment ledger - measure first, merge second

experiment	verdict
Search-provider bake-off (leakage axis) - today	retrieval leakage = 3.1x inflation; honest 0.118
Top-K event-semantics classifier	shipped a1f899b0
Opus 4.7 + market anchor; 0.10 longshot floor cap	shipped
Multi-vendor swap (Gemini / GPT-5.5 / GPT-5.2)	keep Opus 4.7
Retrieval count sweep 3 / 5 / 8	failed gate
Verification prompt / self-critique / ensemble-of-6	regressed
Abstain-to-market (price from snippets)	no-op (0/26 had price)
SAE shrinkage	0.1157 - re-open vs honest*
Subset-1200 scale validation	0.1224 honest, n=1200

Open lanes

Analysis Claude

Re-baseline all surfaces to honest 0.118
Re-run SAE / abstain vs the leakage-free baseline (may flip the negative)
Real multi-prompt sweep with bootstrap CIs

Frontend / infra Codex

Honest-vs-hindsight panel + per-provider leakage %
Daily auto-resolve + score loop to grow resolved-n
Provider sweep (Tavily / Exa / Serper) once keys land

Non-negotiables: production change needs |delta| > 0.01 single-binary Brier and a 95% paired-bootstrap CI excluding zero · shadow loop is research-only · mid-window deploys get a DECISIONS.md entry (SHA + UTC) · coordinate before touching forecast_track.py / forecast_agent_server.py / data/resolved.json.

Where to read more

State

docs/HANDOFF.md - cold-start state + work lanes

Decisions

docs/DECISIONS.md - append-only log (newest at bottom)

Findings

docs/FINDINGS.md · RESEARCH_QUEUE.md · FORWARD_RESEARCH_PLAN.md

* SAE / abstain were rejected against the hindsight-inflated 0.038; against the honest 0.118 they are roughly competitive, so they are flagged for re-evaluation. This page is a hand-maintained snapshot; numbers trace to DECISIONS.md and data/predictions/. The Oracles · ForecastingPath · honest backtest, leakage-audited.