Private operations
Observatory
Commit, prediction store, first-call state, and review links in one place. Use this view for screenshots after the first PA call lands.
Live state
Live commit69e9bff9
Production variantmulti_outcome_retrieval
Predictions in memory50
Persisted records385
Last PA callNugegoda Sports Welfare Club vs Baduraliya CC
Last total latency2557 ms
Server uptime19d 5h
First-call watch
PA activity observed. When a call lands, inspect event shape, outcome list, total latency, parser path, warnings, and returned probabilities before changing any production routing.
| Failure mode | Status | Operator response |
|---|---|---|
| Brave degraded | monitor via full check | Verify fallback path, do not change model first. |
| Metric drift | known caveat | Keep single-binary and proper multi-class labeled separately. |
| Commit drift | covered | Compare this page's commit with origin/main before deploy claims. |
Recent persisted predictions
| Market | Event | Probabilities | Latency | Parse path | Warnings |
|---|---|---|---|---|---|
KXT20MATCH-26MAY310430BADNUG2026-06-01T20:24:58 | Nugegoda Sports Welfare Club vs Baduraliya CC Sports | Nugegoda Sports Welfare Club: 0.50, Baduraliya CC: 0.50 | 2557 ms | direct | — |
KXT20MATCH-26MAY310015ACESIN2026-06-01T20:11:03 | Sinhalese Sports Club vs Ace Capital CC Sports | Ace Capital CC: 0.55, Sinhalese Sports Club: 0.45 | 3493 ms | direct | — |
KXHIGHLAX-26JUN012026-06-01T20:01:15 | Highest temperature in LA on Jun 1, 2026? Climate and Weather | 76° or above: 0.15, 74° to 75°: 0.17, 70° to 71°: 0.23 | 3438 ms | direct | — |
KXHIGHNY-26JUN012026-06-01T19:46:03 | Highest temperature in NYC on Jun 1, 2026? Climate and Weather | 69° or below: 0.13, 72° to 73°: 0.18, 70° to 71°: 0.16 | 4528 ms | direct | — |
KXAAAGASD-26JUN022026-06-01T19:36:20 | US gas prices on Jun 2, 2026 Economics | Above 4.295: 0.12, Above 4.305: 0.11, Above 4.290: 0.12 | 4134 ms | direct | — |
KXSPOTIFYGLOBALD-26JUN012026-06-01T19:30:46 | Top Global Song on Spotify on Jun 1, 2026? Entertainment | SWIM: 0.06, Beat It: 0.06, hate that i made you love me: 0.13 | 6052 ms | direct | — |
KXSPOTIFYD-26JUN012026-06-01T19:25:28 | Top USA Song on Spotify on Jun 1, 2026? Entertainment | Choosin' Texas: 0.05, hate that i made you love me: 0.12, Janice STFU: 0.06 | 4050 ms | direct | — |
KXTRUEV-26JUN012026-06-01T19:18:39 | EV commodity prices on Jun 1, 2026 Economics | Above 1257.28: 0.50, Above 1247.28: 0.55, Above 1277.28: 0.42 | 4498 ms | direct | — |
KXSILVERD-26JUN01172026-06-01T18:58:40 | Silver price on June 01, 2026 at 5:00 PM EDT? Commodities | above $75: 0.82, above $75.50: 0.70, above $76: 0.45 | 6770 ms | direct | — |
KXNATGASD-26JUN01172026-06-01T18:38:42 | Natural gas price on June 01, 2026 at 5:00 PM EDT? Commodities | above $3.340: 0.05, above $3.370: 0.05, above $3.365: 0.05 | 10064 ms | — | llm call failed: unparseable multi-outcome model output: {"probabilities": {"above $3.340": 0.25, "above $3.370": 0.22, "above $3.365": 0.23, "above $3.320": 0.28, "above $3.180 |
Experiment board
| Question | Status | Current answer | Next action |
|---|---|---|---|
| Why not GPT-5.5? | tested | GPT-5.5 was tried. All-event single-binary Brier is 0.0920 vs Opus 4.7 at 0.0378; its binary-only subset is strong, but that is not the full endpoint metric. | Keep as ablation, not production. |
| Why current model? | measured | Best verifiable PA CLI score on current sample; Opus 4.6 is close and slightly better under proper multi-class, so this is not a settled universal claim. | Re-evaluate after live PA calls. |
| Prompt ablation | in progress elsewhere | Production, no-anchor, no-scale, minimal, and meta-role variants should be tracked here once finalized. | Import final JSON, cost, and caveats. |
| Retrieval count | not settled | Current count is based on prior plateau guidance, not yet independently optimized on our data. | Run 0/3/5/8 with same model and prompt. |
| Scoring method | corrected | PA CLI single-binary and proper multi-class must both be labeled. Do not mix rows across metrics. | Keep public claims conservative. |
Questions to expect
- What is on the public root? Brand, status, links to the report PDF and dashboard. The exact prompt, model rationale, and ablation tables sit on this console.
- Is the 26-event backtest enough? No. It is small, sports-heavy, and contains some post-resolution retrieval leakage (audited at 38.5%). Useful for direction and regression catching, not as proof of live performance.
- What to check on the first PA call? Event schema, outcome count, total latency, parser path, warnings, and whether the trace matches what the dashboard demo predicted it would.
- What is the no-touch list? Production variant, forecast model, longshot-floor formula, Railway env vars. Anything else can change after measurement.
Public surface
Exact model namesprivate
Prompt/retrieval detailsprivate
Raw predictions and tracesprivate
High-level system statuspublic
Health endpointpublic