Evaluation
Replay recorded eval sets against any agent and score tool trajectories and final responses with pluggable metrics.
The eval feature replays recorded conversations — eval sets — against any BaseAgent and scores the results. Eval sets are plain, portable JSON files, wire-compatible with Python ADK's adk eval output: files written by the Python tooling load unmodified, including FunctionCall-shaped tool_uses with extra fields. Record a conversation once and replay it against any agent build.
Eval-set format
An EvalSet is a list of EvalCases; each case is a conversation of Invocations — one user prompt, the expected final response, and the expected intermediate data (tool calls and intermediate model turns). All shapes derive plain serde, so a file is just JSON. Ids hit the wire under their Python ADK names — eval_set_id and eval_id — with the older adk-rs id key accepted on read as a legacy alias:
{
"eval_set_id": "hello-world-set",
"name": "Hello world",
"creation_timestamp": 0.0,
"eval_cases": [
{
"eval_id": "case-1",
"name": "greeting with a tool call",
"session_input": null,
"conversation": [
{
"invocation_id": "inv-1",
"user_content": {
"role": "user",
"parts": [{ "text": "What's the weather in Paris?" }]
},
"final_response": {
"role": "model",
"parts": [{ "text": "It is sunny in Paris, 22 degrees." }]
},
"intermediate_data": {
"tool_uses": [
{ "name": "get_weather", "args": { "city": "Paris" } }
],
"intermediate_responses": []
}
}
]
}
]
}| Type | Fields |
|---|---|
EvalSet | id (serializes as eval_set_id), name (default empty), description: Option<String>, eval_cases, creation_timestamp (seconds, default 0). |
EvalCase | id (serializes as eval_id), conversation: Vec<Invocation>, session_input: Option<SessionInput>, name: Option<String>, creation_timestamp. |
Invocation | user_content: Content, final_response: Option<Content>, intermediate_data (default empty), invocation_id (default empty), creation_timestamp. |
IntermediateData | tool_uses: Vec<ToolUse>, intermediate_responses: Vec<(String, Vec<Part>)> — (author, parts) pairs, byte-compatible with Python. |
SessionInput | app_name, user_id, state (initial session state map). |
ToolUse | name, args (JSON value). Python stores full FunctionCall objects here; extra fields (e.g. id) are ignored on read. |
Loading
fn load_eval_set_from_str(s: &str) -> Result<EvalSet>- Parse from a JSON string.
async fn load_eval_set_from_file(path: impl AsRef<Path>) -> Result<EvalSet>- Read and parse a JSON file with
tokio::fs.
Both are thin wrappers over serde — serde_json::from_slice::<EvalSet>(&bytes) works just as well.
EvalRunner
fn new(agent: Arc<dyn BaseAgent>, app_name: impl Into<String>, user_id: impl Into<String>, evaluators: Vec<Arc<dyn Evaluator>>) -> EvalRunner- Construct with the agent under test, an app name, a user id, and the metric set.
async fn run_set(&self, set: &EvalSet) -> Result<EvalReport>- Run every case in order;
EvalReport { results: Vec<EvalResult> }. async fn run_case(&self, set_id: &str, case: &EvalCase) -> Result<EvalResult>- Run one case. Each case gets a fresh in-memory session; the conversation’s user messages replay in order against the same session.
For each invocation the runner drives the agent directly (no Runner), collects every FunctionCall part into actual tool_uses, records non-final content as (author, parts) pairs in intermediate_responses, and takes the event where is_final_response() holds as the actual final_response. Each evaluator scores every invocation; the reported per-evaluator score is the average across the case’s invocations, with the per-invocation scores riding in details.per_invocation. A metric’s status is the AND of its per-invocation statuses (Error dominating), and any non-PASSED metric flips the case’s overall_status to FAILED.
Metrics
Metrics implement the Evaluator trait — name(&self) -> &str (the key in the scores map) and async fn evaluate(&self, expected: &Invocation, actual: &Invocation) -> Result<EvalScore>. Two ship in the box:
| Metric | Key | Algorithm |
|---|---|---|
TrajectoryMatch::new(threshold) | tool_trajectory_avg_score | In-order exact matching of (tool name, args) pairs from intermediate_data.tool_uses. Score = matched / max(expected len, actual len, 1). Default threshold 1.0 = exact trajectory. |
ResponseMatch::new(threshold) | final_response_match_v1 | Case-insensitive, whole-token unigram overlap: the fraction of expected tokens that appear as whole tokens in the actual response text — an expected token cat does not match inside concatenate. A deliberately rough, Rouge-like metric (not a true Rouge-L); an empty expected response scores 1.0. Default threshold 0.8. |
Scores are EvalScore { score: f64, status: EvalStatus, details: Value } with status derived from score >= threshold. EvalStatus serializes uppercase: PASSED, FAILED, or ERROR. An EvalResult carries eval_set_id, eval_case_id, the per-evaluator scores map (each score the invocation average described above), and overall_status (the logical AND of every metric across every invocation in the case).
Running evals
my-app eval --agent greeter --set samples/hello_world.evalset.json
# prints the EvalReport as pretty JSONThe CLI eval subcommand uses TrajectoryMatch::new(1.0) and ResponseMatch::new(0.5). Programmatically you pick your own metrics and thresholds:
use adk_rs::eval::{EvalRunner, EvalSet, ResponseMatch, TrajectoryMatch};
use std::sync::Arc;
let bytes = tokio::fs::read("hello_world.evalset.json").await?;
let set: EvalSet = serde_json::from_slice(&bytes)?;
let runner = EvalRunner::new(
agent, // Arc<dyn BaseAgent>
"hello_world",
"eval-user",
vec![
Arc::new(TrajectoryMatch::new(1.0)),
Arc::new(ResponseMatch::new(0.5)),
],
);
let report = runner.run_set(&set).await?;
for r in &report.results {
println!("{} -> {:?}", r.eval_case_id, r.overall_status);
}- Embedded CLI — the
evalsubcommand. - Testing agents — unit-level testing with mocks.
- Events —
is_final_responseand the event shapes the runner inspects.