Judge review brief

Operator one-pager. Live system state, the evidence the agent is using, current caveats, and the first-PA-call checklist on a single screen.

Current proof

Live commit69e9bff9
Variantmulti_outcome_retrieval
Prediction records50 memory / 385 disk
PA statusPA activity observed
Public surfacelanding + report PDF

Demo script

  1. Open public root; brand, status, link to the report.
  2. Open dashboard; live commit, variant, first-call status, and operator links.
  3. Run the pipeline demo and point to stage updates plus returned JSON.
  4. Open summary for the scored evidence and metric caveats.
  5. Open galleries to show per-event behavior instead of only aggregate claims.

First Prophet Arena call

  • Check /predictions count and disk persistence.
  • Inspect outcome count, parser path, warnings, and total latency.
  • Compare event shape with the assumptions in the dashboard demo.
  • Do not change the production variant without measured evidence.

Likely questions

QuestionAnswer to giveEvidence link
Why not GPT-5.5?It was tested. The binary subset looked strong, but full endpoint behavior and schema compliance matter; the current report keeps it as an ablation rather than production.summary report
Is the backtest enough?No. It is small and partially vulnerable to resolved-event retrieval leakage. We treat it as regression evidence, not final proof of live performance.comparison grid
What makes the system autonomous?The endpoint accepts event JSON, retrieves evidence, produces structured probabilities, logs traces, streams operator state, and persists prediction records without manual scoring work.dashboard
What is the biggest current risk?First live PA payload shape and scoring behavior are still the decisive unknowns. The first-call checklist exists to inspect that before changing model or prompt.observatory
Do we reveal too much publicly?The public root is sparse. Exact model names, ablations, galleries, traces, and research reports are dashboard-auth gated.public root