/evals and /evals/benchmarks console routes. They are also the integration point for CI pipelines that run agent-safety benchmarks (badcomputeruse, custom suites) and want their grades surfaced in the org’s scorecard.
Endpoints
All endpoints live under/api/v1/evals and are scoped to the caller’s active organization.
Submit an eval run
EvalRunPublic row including id, created_at, and organization_id.
List eval runs
Latest run
badcomputeruse). The console scorecard reads this endpoint.
Get a run
Compare two runs
Generate adversarial scenarios
agent_id (pulls action definitions from the platform’s action registry) or inline tool_definitions is required. The generator runs two passes:
- Template —
agent-sentinel-gymemits structural attacks: missing prerequisites, stale evidence, denied actions, budget blowouts. - Gemini (when
use_llm: true) — produces semantic adversaries acrossprompt_injection,policy_evasion,social_engineering,cost_abuse,data_exfiltration.
501 Not Implemented if the platform was deployed without agent-sentinel-gym. Returns 404 if agent_id has no registered action definitions.
Published benchmark report
The phase-7 reference benchmark is rendered in the console at/evals/benchmarks. It runs five end-to-end scenarios (PII, budget, ungrounded refund, missing evidence, disallowed content) against gemini-2.5-flash with guardrails on/off. Headline result:
| Configuration | Violations | Blocked |
|---|---|---|
| Guardrails OFF | 4 / 5 (80%) | 0 / 5 |
| Guardrails ON | 0 / 5 | 5 / 5 |
/evals/benchmarks.
See also
- Console → Evals — scorecard, scenario generator, benchmark report
- SDK → LLM integrations — Gemini scenario gen
