Judge review brief
Operator one-pager. Live system state, the evidence the agent is using, current caveats, and the first-PA-call checklist on a single screen.
Current proof
Live commit69e9bff9
Variantmulti_outcome_retrieval
Prediction records50 memory / 385 disk
PA statusPA activity observed
Public surfacelanding + report PDF
Demo script
- Open public root; brand, status, link to the report.
- Open dashboard; live commit, variant, first-call status, and operator links.
- Run the pipeline demo and point to stage updates plus returned JSON.
- Open summary for the scored evidence and metric caveats.
- Open galleries to show per-event behavior instead of only aggregate claims.
First Prophet Arena call
- Check
/predictionscount and disk persistence. - Inspect outcome count, parser path, warnings, and total latency.
- Compare event shape with the assumptions in the dashboard demo.
- Do not change the production variant without measured evidence.
Likely questions
| Question | Answer to give | Evidence link |
|---|---|---|
| Why not GPT-5.5? | It was tested. The binary subset looked strong, but full endpoint behavior and schema compliance matter; the current report keeps it as an ablation rather than production. | summary report |
| Is the backtest enough? | No. It is small and partially vulnerable to resolved-event retrieval leakage. We treat it as regression evidence, not final proof of live performance. | comparison grid |
| What makes the system autonomous? | The endpoint accepts event JSON, retrieves evidence, produces structured probabilities, logs traces, streams operator state, and persists prediction records without manual scoring work. | dashboard |
| What is the biggest current risk? | First live PA payload shape and scoring behavior are still the decisive unknowns. The first-call checklist exists to inspect that before changing model or prompt. | observatory |
| Do we reveal too much publicly? | The public root is sparse. Exact model names, ablations, galleries, traces, and research reports are dashboard-auth gated. | public root |