/evals. It is the home for benchmark results submitted by agent-sentinel-gym (or any other harness that calls POST /api/v1/evals/), and is where you’ll watch CI gates pass or fail across deployments.
Layout
The page has four blocks, top to bottom:- Published Benchmarks card — links to
/evals/benchmarks(markdown render of the published benchmark report). - Latest Scorecard — five tiles: Grade, Block Rate, Correctness Rate, Damage Prevented (USD), Avg Detection time (ms). Includes a CI Pass/Fail badge if the most recent run was tied to a commit.
- Eval Run History table — every submitted run with grade, block rate, CI status, branch, commit SHA, and date. Compare Latest vs Previous button at the top right runs a delta diff.
- Run detail modal — opens from the Details button: per-scenario table with category, severity, blocked/failed badge, reason, and detection time.
Generating scenarios
The page header has a Generate Scenarios button that opens the scenario-generator dialog:- Select an agent from the dropdown (uses the agent’s registered action definitions).
- Set Max per category (default
10, max50). - Click Generate.
- Template scenarios —
agent-sentinel-gymemits structural attacks: missing prerequisites, stale evidence, denied actions, budget blowouts. - Gemini scenarios — semantic adversaries across
prompt_injection,policy_evasion,social_engineering,cost_abuse,data_exfiltration(skipped ifGEMINI_API_KEYis not set on the platform).
Comparing runs
Click Compare Latest vs Previous (or use the API directly) to see:- Block rate delta (percentage points)
- Correctness rate delta
- Grade change (e.g.,
B → A) - Regressions — scenario names that passed on baseline but failed on current
- Improvements — scenario names that failed on baseline but pass on current
Published benchmarks (/evals/benchmarks)
The benchmarks sub-route renders the markdown report directly from the repo (benchmarks/BENCHMARK_REPORT.md). The phase-7 reference run shows:
| Configuration | Violations | Blocked |
|---|---|---|
| Guardrails OFF | 4 / 5 (80%) | 0 / 5 |
| Guardrails ON | 0 / 5 | 5 / 5 |
CI integration
Submit eval runs from CI:--upload flag posts the result to POST /api/v1/evals/. The --gate exit code is non-zero when block rate falls below the configured threshold, blocking the merge.
Underlying API
GET /api/v1/evals/latest— scorecard tilesGET /api/v1/evals/?skip=&limit=— history tableGET /api/v1/evals/compare/{baseline_id}/{current_id}— comparison blockPOST /api/v1/evals/generate-scenarios— scenario generator dialogPOST /api/v1/evals/— CI submissions
See also
- Platform → Evals API
- SDK → LLM integrations — Gemini-powered scenario gen
- SDK → Guardrails — what guardrails the benchmark exercises
