Judge review brief

Operator one-pager. Live system state, the evidence the agent is using, current caveats, and the first-PA-call checklist on a single screen.

Current proof

Live commit69e9bff9

Variantmulti_outcome_retrieval

Prediction records50 memory / 385 disk

PA statusPA activity observed

Public surfacelanding + report PDF

Question	Answer to give	Evidence link
Why not GPT-5.5?	It was tested. The binary subset looked strong, but full endpoint behavior and schema compliance matter; the current report keeps it as an ablation rather than production.	summary report
Is the backtest enough?	No. It is small and partially vulnerable to resolved-event retrieval leakage. We treat it as regression evidence, not final proof of live performance.	comparison grid
What makes the system autonomous?	The endpoint accepts event JSON, retrieves evidence, produces structured probabilities, logs traces, streams operator state, and persists prediction records without manual scoring work.	dashboard
What is the biggest current risk?	First live PA payload shape and scoring behavior are still the decisive unknowns. The first-call checklist exists to inspect that before changing model or prompt.	observatory
Do we reveal too much publicly?	The public root is sparse. Exact model names, ablations, galleries, traces, and research reports are dashboard-auth gated.	public root