Evals - Agent Sentinel

The Evals route lives at /evals. It is the home for benchmark results submitted by agent-sentinel-gym (or any other harness that calls POST /api/v1/evals/), and is where you’ll watch CI gates pass or fail across deployments.

Layout

The page has four blocks, top to bottom:

Published Benchmarks card — links to /evals/benchmarks (markdown render of the published benchmark report).
Latest Scorecard — five tiles: Grade, Block Rate, Correctness Rate, Damage Prevented (USD), Avg Detection time (ms). Includes a CI Pass/Fail badge if the most recent run was tied to a commit.
Eval Run History table — every submitted run with grade, block rate, CI status, branch, commit SHA, and date. Compare Latest vs Previous button at the top right runs a delta diff.
Run detail modal — opens from the Details button: per-scenario table with category, severity, blocked/failed badge, reason, and detection time.

Generating scenarios

The page header has a Generate Scenarios button that opens the scenario-generator dialog:

Select an agent from the dropdown (uses the agent’s registered action definitions).
Set Max per category (default 10, max 50).
Click Generate.

The platform runs two passes:

Template scenarios — agent-sentinel-gym emits structural attacks: missing prerequisites, stale evidence, denied actions, budget blowouts.
Gemini scenarios — semantic adversaries across prompt_injection, policy_evasion, social_engineering, cost_abuse, data_exfiltration (skipped if GEMINI_API_KEY is not set on the platform).

Results show as scenario cards with category badges and a per-category violation breakdown. Cards include the generated source so you can paste them straight into your test harness.

Comparing runs

Click Compare Latest vs Previous (or use the API directly) to see:

Block rate delta (percentage points)
Correctness rate delta
Grade change (e.g., B → A)
Regressions — scenario names that passed on baseline but failed on current
Improvements — scenario names that failed on baseline but pass on current

CI integrations should fail the build whenever Regressions > 0.

Published benchmarks (`/evals/benchmarks`)

The benchmarks sub-route renders the markdown report directly from the repo (benchmarks/BENCHMARK_REPORT.md). The phase-7 reference run shows:

Configuration	Violations	Blocked
Guardrails OFF	4 / 5 (80%)	0 / 5
Guardrails ON	0 / 5	5 / 5

Use this as a public benchmark to compare your own scorecards against.

CI integration

Submit eval runs from CI:

python -m badcomputeruse gate \
  --policy policies/production_defense.yaml \
  --upload \
  --platform-url $AGENTSENTINEL_PLATFORM_URL \
  --api-key $AGENTSENTINEL_API_KEY

The --upload flag posts the result to POST /api/v1/evals/. The --gate exit code is non-zero when block rate falls below the configured threshold, blocking the merge.

Underlying API

GET /api/v1/evals/latest — scorecard tiles
GET /api/v1/evals/?skip=&limit= — history table
GET /api/v1/evals/compare/{baseline_id}/{current_id} — comparison block
POST /api/v1/evals/generate-scenarios — scenario generator dialog
POST /api/v1/evals/ — CI submissions

Full reference: Platform → Evals API.

​Layout

​Generating scenarios

​Comparing runs

​Published benchmarks (/evals/benchmarks)

​CI integration

​Underlying API

​See also