Evaluation

Replay recorded eval sets against any agent and score tool trajectories and final responses with pluggable metrics.

The eval feature replays recorded conversations — eval sets — against any BaseAgent and scores the results. Eval sets are plain, portable JSON files, wire-compatible with Python ADK's adk eval output: files written by the Python tooling load unmodified, including FunctionCall-shaped tool_uses with extra fields. Record a conversation once and replay it against any agent build.

Eval-set format

An EvalSet is a list of EvalCases; each case is a conversation of Invocations — one user prompt, the expected final response, and the expected intermediate data (tool calls and intermediate model turns). All shapes derive plain serde, so a file is just JSON. Ids hit the wire under their Python ADK names — eval_set_id and eval_id — with the older adk-rs id key accepted on read as a legacy alias:

hello_world.evalset.jsonjson
{
  "eval_set_id": "hello-world-set",
  "name": "Hello world",
  "creation_timestamp": 0.0,
  "eval_cases": [
    {
      "eval_id": "case-1",
      "name": "greeting with a tool call",
      "session_input": null,
      "conversation": [
        {
          "invocation_id": "inv-1",
          "user_content": {
            "role": "user",
            "parts": [{ "text": "What's the weather in Paris?" }]
          },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "It is sunny in Paris, 22 degrees." }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "get_weather", "args": { "city": "Paris" } }
            ],
            "intermediate_responses": []
          }
        }
      ]
    }
  ]
}
TypeFields
EvalSetid (serializes as eval_set_id), name (default empty), description: Option<String>, eval_cases, creation_timestamp (seconds, default 0).
EvalCaseid (serializes as eval_id), conversation: Vec<Invocation>, session_input: Option<SessionInput>, name: Option<String>, creation_timestamp.
Invocationuser_content: Content, final_response: Option<Content>, intermediate_data (default empty), invocation_id (default empty), creation_timestamp.
IntermediateDatatool_uses: Vec<ToolUse>, intermediate_responses: Vec<(String, Vec<Part>)>(author, parts) pairs, byte-compatible with Python.
SessionInputapp_name, user_id, state (initial session state map).
ToolUsename, args (JSON value). Python stores full FunctionCall objects here; extra fields (e.g. id) are ignored on read.

Loading

fn load_eval_set_from_str(s: &str) -> Result<EvalSet>
Parse from a JSON string.
async fn load_eval_set_from_file(path: impl AsRef<Path>) -> Result<EvalSet>
Read and parse a JSON file with tokio::fs.

Both are thin wrappers over serde — serde_json::from_slice::<EvalSet>(&bytes) works just as well.

EvalRunner

fn new(agent: Arc<dyn BaseAgent>, app_name: impl Into<String>, user_id: impl Into<String>, evaluators: Vec<Arc<dyn Evaluator>>) -> EvalRunner
Construct with the agent under test, an app name, a user id, and the metric set.
async fn run_set(&self, set: &EvalSet) -> Result<EvalReport>
Run every case in order; EvalReport { results: Vec<EvalResult> }.
async fn run_case(&self, set_id: &str, case: &EvalCase) -> Result<EvalResult>
Run one case. Each case gets a fresh in-memory session; the conversation’s user messages replay in order against the same session.

For each invocation the runner drives the agent directly (no Runner), collects every FunctionCall part into actual tool_uses, records non-final content as (author, parts) pairs in intermediate_responses, and takes the event where is_final_response() holds as the actual final_response. Each evaluator scores every invocation; the reported per-evaluator score is the average across the case’s invocations, with the per-invocation scores riding in details.per_invocation. A metric’s status is the AND of its per-invocation statuses (Error dominating), and any non-PASSED metric flips the case’s overall_status to FAILED.

Metrics

Metrics implement the Evaluator trait — name(&self) -> &str (the key in the scores map) and async fn evaluate(&self, expected: &Invocation, actual: &Invocation) -> Result<EvalScore>. Two ship in the box:

MetricKeyAlgorithm
TrajectoryMatch::new(threshold)tool_trajectory_avg_scoreIn-order exact matching of (tool name, args) pairs from intermediate_data.tool_uses. Score = matched / max(expected len, actual len, 1). Default threshold 1.0 = exact trajectory.
ResponseMatch::new(threshold)final_response_match_v1Case-insensitive, whole-token unigram overlap: the fraction of expected tokens that appear as whole tokens in the actual response text — an expected token cat does not match inside concatenate. A deliberately rough, Rouge-like metric (not a true Rouge-L); an empty expected response scores 1.0. Default threshold 0.8.

Scores are EvalScore { score: f64, status: EvalStatus, details: Value } with status derived from score >= threshold. EvalStatus serializes uppercase: PASSED, FAILED, or ERROR. An EvalResult carries eval_set_id, eval_case_id, the per-evaluator scores map (each score the invocation average described above), and overall_status (the logical AND of every metric across every invocation in the case).

Running evals

Via the embedded CLIbash
my-app eval --agent greeter --set samples/hello_world.evalset.json
# prints the EvalReport as pretty JSON

The CLI eval subcommand uses TrajectoryMatch::new(1.0) and ResponseMatch::new(0.5). Programmatically you pick your own metrics and thresholds:

Programmatic eval runrust
use adk_rs::eval::{EvalRunner, EvalSet, ResponseMatch, TrajectoryMatch};
use std::sync::Arc;

let bytes = tokio::fs::read("hello_world.evalset.json").await?;
let set: EvalSet = serde_json::from_slice(&bytes)?;

let runner = EvalRunner::new(
    agent,                 // Arc<dyn BaseAgent>
    "hello_world",
    "eval-user",
    vec![
        Arc::new(TrajectoryMatch::new(1.0)),
        Arc::new(ResponseMatch::new(0.5)),
    ],
);
let report = runner.run_set(&set).await?;
for r in &report.results {
    println!("{} -> {:?}", r.eval_case_id, r.overall_status);
}