Context caching

Cache the stable request prefix server-side with ContextCacheConfig and a static instruction to cut cost and latency.

Agents with large instructions or tool sets resend the same prefix on every LLM call. With a ContextCacheConfig attached, cache-capable providers reuse that stable prefix — system instruction plus tool declarations — server-side instead of reprocessing it, cutting both token cost and latency. Gemini creates an explicit cachedContents entry; Anthropic gets a cache_control breakpoint on the same prefix; OpenAI caches automatically with nothing to configure.

ContextCacheConfig

ContextCacheConfig { cache_intervals: u32, ttl_seconds: u64, min_tokens: u64 }
Configuration for explicit provider-side context caching. Implements Default.
cache_intervals: u32 — default 10
Maximum number of LLM calls served by one cache entry before it is refreshed; guards against unbounded staleness.
ttl_seconds: u64 — default 1800
Cache-entry time-to-live, in seconds.
min_tokens: u64 — default 0
Minimum estimated token size of the cacheable prefix. Smaller prefixes are sent inline — caching tiny prefixes costs more than it saves, and Gemini also enforces a server-side minimum.

Wiring it up

Set the config once on the runner — it is copied into every invocation's RunConfig — or per invocation via RunConfig::context_cache_config, which overrides the app-level value. The agent stamps the config onto each LlmRequest it builds (cache_config), and the provider takes it from there.

Runner-level configrust
use adk_rs::core::ContextCacheConfig;

let runner = Runner::builder()
    .app_name("support")
    .agent(agent)
    .session_service(svc)
    .context_cache_config(ContextCacheConfig {
        cache_intervals: 10,
        ttl_seconds: 1800,
        min_tokens: 2048,
    })
    .build()?;

Pair with static_instruction

Caching only pays off if the prefix is byte-identical across turns. A regular .instruction(...) is templated against session state ({key} substitution) and re-resolved every turn, so any change — a new state value, a dynamic provider — produces a different system instruction and a cache miss. LlmAgent::static_instruction exists for exactly this: it is sent verbatim, never templated, never re-evaluated, at the very start of the system instruction. When a static instruction is present, the dynamic instruction is moved out of the system prompt and appended to the request contents (after the user turn), so the cached prefix stays stable.

Cache-stable prefix + dynamic remainderrust
let agent = LlmAgent::builder("support")
    .model(Arc::new(Gemini::from_env("gemini-2.5-flash")?))
    // Large, stable: policies, product docs, few-shot examples.
    .static_instruction(POLICY_HANDBOOK)
    // Small, per-turn: templated from session state, rides in contents.
    .instruction("The current user's plan is {plan}.")
    .build()?;
LlmAgentBuilder::static_instruction(self, s: impl Into<String>) -> Self
Cache-stable instruction prefix as text.
LlmAgentBuilder::static_instruction_content(self, c: Content) -> Self
Same, but accepts arbitrary Content (e.g. multimodal parts).
RunnerBuilder::context_cache_config(self, cfg: ContextCacheConfig) -> Self
Enable explicit caching for every invocation; per-invocation RunConfig overrides it.

What the Gemini provider does

  • Computes a fingerprint of the cacheable prefix: model + system instruction + tool declarations. Any change invalidates the entry.
  • Estimates prefix size (characters / 4) and skips caching below min_tokens, or when there is no system instruction and no tools.
  • Creates an entry via POST /cachedContents and tracks it in-process with its expiry and use count; subsequent requests reference the entry with cachedContent and omit the cached fields from the request body.
  • Refreshes the entry after cache_intervals uses or when the TTL lapses.
  • On a creation failure, disables caching for that prefix for one TTL and logs a warning — caching is an optimization, never a source of run failures.

What the Anthropic provider does

Anthropic's prompt cache is server-managed, so there is no entry to create: the same ContextCacheConfig instead becomes a cache_control breakpoint on the stable prefix. The breakpoint lands on the system block (tools render before system on Anthropic's side, so one marker caches both); with no system instruction it lands on the last tool declaration. A ttl_seconds of 3600 or more selects the 1-hour cache tier, anything less the default 5-minute tier. cache_intervals and min_tokens are Gemini-specific and ignored — Anthropic enforces its own server-side minimums. Cache activity is reported per response via cache_metadata (cache_hit) and usage_metadata.cached_content_token_count. Pair with static_instruction exactly as for Gemini: the breakpoint only pays off when the prefix is byte-identical across turns.

Observability: CacheMetadata

Cache-capable providers attach a CacheMetadata to each LlmResponse, which surfaces on events as event.response.cache_metadata:

struct CacheMetadata { cache_name: String, cache_hit: bool }
cache_name is the provider-side resource (e.g. cachedContents/abc123); cache_hit is true when the response was served against an existing entry, false when the entry was created for this call.
Watching cache behaviourrust
let mut events = runner.run("user", Some(&session_id), "next question").await?;
while let Some(event) = events.next().await {
    let event = event?;
    if let Some(meta) = &event.response.cache_metadata {
        println!("cache {} hit={}", meta.cache_name, meta.cache_hit);
    }
}

  • LlmAgent — instructions, static instructions, and templating.
  • Providers — provider capabilities and configuration.
  • Event compaction — the complementary lever for long histories rather than long prefixes.