Skip to main content
The eval endpoints back the /evals and /evals/benchmarks console routes. They are also the integration point for CI pipelines that run agent-safety benchmarks (badcomputeruse, custom suites) and want their grades surfaced in the org’s scorecard.

Endpoints

All endpoints live under /api/v1/evals and are scoped to the caller’s active organization.

Submit an eval run

POST /api/v1/evals/
Content-Type: application/json

{
  "agent_id": "support-bot-v3",
  "benchmark_name": "badcomputeruse",
  "policy_name": "production-strict",
  "results": [
    { "attack_name": "S1_pii_exfil",       "blocked": true },
    { "attack_name": "S2_budget_blowout",  "blocked": true },
    { "attack_name": "S3_ungrounded_refund", "blocked": true }
  ],
  "grade": "A",
  "block_rate": 1.0,
  "correctness_block_rate": 1.0,
  "ci_branch": "main",
  "ci_commit": "abc1234"
}
Returns the persisted EvalRunPublic row including id, created_at, and organization_id.

List eval runs

GET /api/v1/evals/?skip=0&limit=50&agent_id=&benchmark_name=&policy_name=
Newest-first, paginated, optionally filtered by agent / benchmark / policy.

Latest run

GET /api/v1/evals/latest?benchmark_name=badcomputeruse&agent_id=
Returns the most recent eval run for the given benchmark (default badcomputeruse). The console scorecard reads this endpoint.

Get a run

GET /api/v1/evals/{eval_id}

Compare two runs

GET /api/v1/evals/compare/{baseline_id}/{current_id}
{
  "baseline":  { /* EvalRunPublic */ },
  "current":   { /* EvalRunPublic */ },
  "block_rate_delta":      0.10,
  "correctness_rate_delta": 0.05,
  "grade_changed":          true,
  "new_failures": ["S7_chained_evasion"],
  "new_passes":   ["S2_budget_blowout"]
}
Used by the console regression view to surface “new failures” against the previous baseline.

Generate adversarial scenarios

POST /api/v1/evals/generate-scenarios
Content-Type: application/json

{
  "agent_id": "support-bot-v3",
  "max_per_category": 10,
  "use_llm": true
}
Either agent_id (pulls action definitions from the platform’s action registry) or inline tool_definitions is required. The generator runs two passes:
  1. Templateagent-sentinel-gym emits structural attacks: missing prerequisites, stale evidence, denied actions, budget blowouts.
  2. Gemini (when use_llm: true) — produces semantic adversaries across prompt_injection, policy_evasion, social_engineering, cost_abuse, data_exfiltration.
{
  "scenarios": [
    { "name": "S1_email_exfil", "category": "data_exfiltration", "source": "..." },
    ...
  ],
  "total_count": 47,
  "violation_breakdown": {
    "prompt_injection": 10,
    "policy_evasion": 10,
    "social_engineering": 8,
    "cost_abuse": 9,
    "data_exfiltration": 10
  }
}
Returns 501 Not Implemented if the platform was deployed without agent-sentinel-gym. Returns 404 if agent_id has no registered action definitions.

Published benchmark report

The phase-7 reference benchmark is rendered in the console at /evals/benchmarks. It runs five end-to-end scenarios (PII, budget, ungrounded refund, missing evidence, disallowed content) against gemini-2.5-flash with guardrails on/off. Headline result:
ConfigurationViolationsBlocked
Guardrails OFF4 / 5 (80%)0 / 5
Guardrails ON0 / 55 / 5
The console renders the same markdown at /evals/benchmarks.

See also