ForecastingPathForecastingPath Observatory
Private operations

Observatory

Commit, prediction store, first-call state, and review links in one place. Use this view for screenshots after the first PA call lands.

Live state

Live commit69e9bff9
Production variantmulti_outcome_retrieval
Predictions in memory50
Persisted records385
Last PA callNugegoda Sports Welfare Club vs Baduraliya CC
Last total latency2557 ms
Server uptime19d 5h

First-call watch

PA activity observed. When a call lands, inspect event shape, outcome list, total latency, parser path, warnings, and returned probabilities before changing any production routing.

Failure modeStatusOperator response
Brave degradedmonitor via full checkVerify fallback path, do not change model first.
Metric driftknown caveatKeep single-binary and proper multi-class labeled separately.
Commit driftcoveredCompare this page's commit with origin/main before deploy claims.

Recent persisted predictions

MarketEventProbabilitiesLatencyParse pathWarnings
KXT20MATCH-26MAY310430BADNUG
2026-06-01T20:24:58
Nugegoda Sports Welfare Club vs Baduraliya CC
Sports
Nugegoda Sports Welfare Club: 0.50, Baduraliya CC: 0.502557 msdirect
KXT20MATCH-26MAY310015ACESIN
2026-06-01T20:11:03
Sinhalese Sports Club vs Ace Capital CC
Sports
Ace Capital CC: 0.55, Sinhalese Sports Club: 0.453493 msdirect
KXHIGHLAX-26JUN01
2026-06-01T20:01:15
Highest temperature in LA on Jun 1, 2026?
Climate and Weather
76° or above: 0.15, 74° to 75°: 0.17, 70° to 71°: 0.233438 msdirect
KXHIGHNY-26JUN01
2026-06-01T19:46:03
Highest temperature in NYC on Jun 1, 2026?
Climate and Weather
69° or below: 0.13, 72° to 73°: 0.18, 70° to 71°: 0.164528 msdirect
KXAAAGASD-26JUN02
2026-06-01T19:36:20
US gas prices on Jun 2, 2026
Economics
Above 4.295: 0.12, Above 4.305: 0.11, Above 4.290: 0.124134 msdirect
KXSPOTIFYGLOBALD-26JUN01
2026-06-01T19:30:46
Top Global Song on Spotify on Jun 1, 2026?
Entertainment
SWIM: 0.06, Beat It: 0.06, hate that i made you love me: 0.136052 msdirect
KXSPOTIFYD-26JUN01
2026-06-01T19:25:28
Top USA Song on Spotify on Jun 1, 2026?
Entertainment
Choosin' Texas: 0.05, hate that i made you love me: 0.12, Janice STFU: 0.064050 msdirect
KXTRUEV-26JUN01
2026-06-01T19:18:39
EV commodity prices on Jun 1, 2026
Economics
Above 1257.28: 0.50, Above 1247.28: 0.55, Above 1277.28: 0.424498 msdirect
KXSILVERD-26JUN0117
2026-06-01T18:58:40
Silver price on June 01, 2026 at 5:00 PM EDT?
Commodities
above $75: 0.82, above $75.50: 0.70, above $76: 0.456770 msdirect
KXNATGASD-26JUN0117
2026-06-01T18:38:42
Natural gas price on June 01, 2026 at 5:00 PM EDT?
Commodities
above $3.340: 0.05, above $3.370: 0.05, above $3.365: 0.0510064 msllm call failed: unparseable multi-outcome model output: {"probabilities": {"above $3.340": 0.25, "above $3.370": 0.22, "above $3.365": 0.23, "above $3.320": 0.28, "above $3.180

Experiment board

QuestionStatusCurrent answerNext action
Why not GPT-5.5?testedGPT-5.5 was tried. All-event single-binary Brier is 0.0920 vs Opus 4.7 at 0.0378; its binary-only subset is strong, but that is not the full endpoint metric.Keep as ablation, not production.
Why current model?measuredBest verifiable PA CLI score on current sample; Opus 4.6 is close and slightly better under proper multi-class, so this is not a settled universal claim.Re-evaluate after live PA calls.
Prompt ablationin progress elsewhereProduction, no-anchor, no-scale, minimal, and meta-role variants should be tracked here once finalized.Import final JSON, cost, and caveats.
Retrieval countnot settledCurrent count is based on prior plateau guidance, not yet independently optimized on our data.Run 0/3/5/8 with same model and prompt.
Scoring methodcorrectedPA CLI single-binary and proper multi-class must both be labeled. Do not mix rows across metrics.Keep public claims conservative.

Questions to expect

  • What is on the public root? Brand, status, links to the report PDF and dashboard. The exact prompt, model rationale, and ablation tables sit on this console.
  • Is the 26-event backtest enough? No. It is small, sports-heavy, and contains some post-resolution retrieval leakage (audited at 38.5%). Useful for direction and regression catching, not as proof of live performance.
  • What to check on the first PA call? Event schema, outcome count, total latency, parser path, warnings, and whether the trace matches what the dashboard demo predicted it would.
  • What is the no-touch list? Production variant, forecast model, longshot-floor formula, Railway env vars. Anything else can change after measurement.

Public surface

Exact model namesprivate
Prompt/retrieval detailsprivate
Raw predictions and tracesprivate
High-level system statuspublic
Health endpointpublic