Skip to content

feat(providers): prompt caching for Anthropic + Azure-Anthropic#5101

Open
waleedlatif1 wants to merge 6 commits into
stagingfrom
feat/provider-prompt-caching
Open

feat(providers): prompt caching for Anthropic + Azure-Anthropic#5101
waleedlatif1 wants to merge 6 commits into
stagingfrom
feat/provider-prompt-caching

Conversation

@waleedlatif1

@waleedlatif1 waleedlatif1 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Marks the static request prefix (system prompt + tools) with an ephemeral cache_control breakpoint for Anthropic (and Azure-Anthropic, which shares the core), so repeated calls — agent tool-loops and multi-turn chats — reuse the cached prefix: ~90% cheaper cached input + lower latency.
  • The tagging lives in one directly-tested helper, applyAnthropicPromptCache(payload, tools, systemPrompt) (anthropic/utils.ts), which gates on whether caching is worthwhile and mutates the system block + last tool.
  • Always on — there is no feature flag. Caching is transparent to outputs, so it runs for every eligible request.

When it caches (the gate)

providers/prompt-cache.ts only applies breakpoints when the static prefix is large enough to be cacheable and likely reused (tools present, or a large system prompt). A one-shot, tool-less call is skipped so it never pays the cache-write surcharge for a prefix that's never read back. The gate is sized on the larger of the final payload.system (which may include appended structured-output schema) and the original request.systemPrompt (non-empty even when the no-messages path relocates it into a user message).

Why this is safe

  • Outputs are identical — prompt caching only reuses the model's computed prefix; it never changes generated responses.
  • Faster + cheaper on Claude (cached input ~0.1×).
  • Cost accounting stays accurate — Anthropic already reads cache_read_input_tokens / cache_creation_input_tokens (buildAnthropicSegmentTokens).

Standard practice

Matches the AI SDK / LangChain / Spring AI / Pydantic AI / LiteLLM convention: explicit cache breakpoints for Claude (Anthropic/Bedrock), automatic server-side caching for OpenAI/Gemini/etc. We auto-place breakpoints on the system+tools prefix (the convergent "SYSTEM_AND_TOOLS" strategy), so users don't hand-mark anything.

Type of Change

  • Performance/cost optimization (no behavioral change to outputs)

Testing

  • bun run type-check clean
  • 12 unit tests (gate logic + the applyAnthropicPromptCache payload mutation across all paths: system→cached block, last-tool tagged, relocated/blanked system, schema-appended system, below-threshold/tool-less no-op), verified on vitest 4.1.8
  • bun run lint clean · bun run check:api-validation passed

Follow-ups (not in this PR)

  • Bedrock (cachePoint) and OpenRouter (cache_control passthrough for Claude) — these need cached-token accounting added alongside (Bedrock doesn't read cacheReadInputTokens/cacheWriteInputTokens), so shipping caching there without it would mis-report cost.
  • Optional prompt_cache_key for OpenAI/Azure.

Checklist

  • Code follows project style guidelines
  • Self-reviewed
  • Tests added/updated and passing
  • No new warnings
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)
Mark the static request prefix (system prompt + tools) with an ephemeral
cache_control breakpoint so repeated calls — agent tool-loops and multi-turn —
reuse the cached prefix (~90% cheaper cached input + lower latency). Azure-
Anthropic inherits this via the shared core.

- New providers/prompt-cache.ts gate: only caches when the static prefix is
  large enough to be cacheable AND likely reused (tools present, or a large
  system prompt), so a one-shot tool-less call never pays the cache-write
  surcharge. Kill switch: PROMPT_CACHE_DISABLED=true.
- anthropic/core.ts: convert system string -> a cached text block (after the
  structured-output concat, which assumes a string) and tag the last tool. Uses
  2 of Anthropic's 4 breakpoints; the tool-loop reuses the tagged payload.
- Outputs are unchanged; cost accounting already reads cache_read/creation
  tokens (buildAnthropicSegmentTokens), so usage stays accurate.

Matches the AI SDK / LangChain / Spring AI convention (explicit breakpoints for
Claude; automatic for OpenAI/Gemini). Bedrock + OpenRouter to follow (they need
cache-token accounting alongside).
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 16, 2026 10:59pm

Request Review

@cursor

cursor Bot commented Jun 16, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Request-shape-only optimization with explicit gating; existing cache token accounting in usage handling is unchanged and outputs are not altered.

Overview
Adds automatic Anthropic prompt caching for the shared Anthropic/Azure-Anthropic request path by tagging the static prefix (system + tool definitions) with ephemeral cache_control when reuse is likely, without changing model outputs.

A new shouldCacheStaticPrefix gate (~1,024 estimated tokens, tools or large system) skips small one-shot calls so they do not pay cache-write surcharges. applyAnthropicPromptCache runs in executeAnthropicProviderRequest after structured-output changes to payload.system, converts a non-empty system string into a cached text block, and sets cache_control on the last tool only. Sizing uses the larger of final payload.system (e.g. appended JSON schema) and the original request.systemPrompt (including when the no-messages path blanks payload.system but tools remain).

Unit tests cover the gate, payload mutation edge cases, and end-to-end payload capture on the streaming/no-tools path.

Reviewed by Cursor Bugbot for commit b9a453d. Configure here.

Comment thread apps/sim/providers/anthropic/core.ts Outdated
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR enables Anthropic prompt caching for the Anthropic and Azure-Anthropic providers by stamping cache_control: { type: 'ephemeral' } on the static request prefix (system prompt + last tool definition). A gating function avoids the cache-write surcharge on small, tool-less one-shot calls.

  • providers/prompt-cache.ts introduces shouldCacheStaticPrefix, which gates caching on a ≥1,024-token combined prefix estimate and requires either tools or a large system prompt alone.
  • providers/anthropic/utils.ts adds applyAnthropicPromptCache, called once after schema mutation in core.ts, that converts the system string to a cached block and tags the last tool; 12 unit tests cover all paths including the no-messages relocation edge case.

Confidence Score: 4/5

Safe to merge with awareness that cache token fees are still excluded from the top-level ProviderResponse cost totals.

The caching logic itself is correct and well-tested. However, now that caching is always on, every warm-cache call reports an inaccurate cost: cache_creation_input_tokens (billed at ~1.25× input rate) and cache_read_input_tokens (billed at ~0.1× input rate) are never added to the accumulated tokens or cost objects returned in ProviderResponse. The per-segment trace handles this correctly via buildAnthropicSegmentTokens, but the response-level totals that callers use for billing display will undercount on every cached request.

apps/sim/providers/anthropic/core.ts — all three token-accumulation sites (streaming-no-tools path ~line 422, non-streaming initial response ~line 860, and tool-loop iteration ~line 1141) need cache token accounting added to match what buildAnthropicSegmentTokens already does correctly.

Important Files Changed

Filename Overview
apps/sim/providers/anthropic/core.ts Single-line insertion of applyAnthropicPromptCache is correct in placement (after schema mutation, before thinking config), but the accumulated tokens/cost returned in ProviderResponse still exclude cache_creation_input_tokens and cache_read_input_tokens, causing systematic cost underreporting on every warm-cache call.
apps/sim/providers/anthropic/utils.ts Adds applyAnthropicPromptCache — correctly handles system-string-to-block conversion, last-tool tagging, and the no-messages relocation edge case. Logic and tests are thorough.
apps/sim/providers/prompt-cache.ts New gate function shouldCacheStaticPrefix — well-designed with correct token estimation, clear invariant (require non-empty system prompt to avoid one-shot write surcharges), and comprehensive unit tests covering all branches.
apps/sim/providers/anthropic/utils.test.ts 12 unit tests covering all applyAnthropicPromptCache paths: large/small system, with/without tools, schema-appended system, relocated/blanked system, and below-threshold no-op.
apps/sim/providers/prompt-cache.test.ts Gate tests are complete and use vi.stubEnv correctly (addressing the prior env-coercion concern).
apps/sim/providers/anthropic/core.test.ts Integration-level request-capture tests verify the system block is tagged for large prompts and left as a plain string for small ones.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
    B --> C{responseFormat?}
    C -->|prompt-based| D[Append schema to\npayload.system]
    C -->|native / none| E[No mutation]
    D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
    E --> F
    F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
    G -->|prefixTokens < 1024\nor no system| H[No-op: return]
    G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
    I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
    I -->|no - relocated| K[Skip system block]
    J --> L{tools present?}
    K --> L
    L -->|yes| M[tools lastIndex.cache_control = ephemeral]
    L -->|no| N[Done]
    M --> N
    N --> O[Add thinking config if requested]
    O --> P[API call]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
    B --> C{responseFormat?}
    C -->|prompt-based| D[Append schema to\npayload.system]
    C -->|native / none| E[No mutation]
    D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
    E --> F
    F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
    G -->|prefixTokens < 1024\nor no system| H[No-op: return]
    G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
    I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
    I -->|no - relocated| K[Skip system block]
    J --> L{tools present?}
    K --> L
    L -->|yes| M[tools lastIndex.cache_control = ephemeral]
    L -->|no| N[Done]
    M --> N
    N --> O[Add thinking config if requested]
    O --> P[API call]
Loading

Comments Outside Diff (1)

  1. apps/sim/providers/anthropic/core.ts, line 860-876 (link)

    P1 Accumulated tokens and cost still omit cache-token fees

    Now that caching is always-on, every cached request produces non-zero cache_creation_input_tokens (billed at ~1.25× the regular input rate) and cache_read_input_tokens (billed at ~0.1×). Neither field is added to the accumulated tokens object, and calculateCost is called without the useCached flag, so the top-level cost returned in ProviderResponse silently undercounts. The per-segment trace path (buildAnthropicSegmentTokenscalculateCost(..., useCached)) handles this correctly, but the accumulated response totals do not.

    The same gap exists in all three code paths: the initial non-streaming token block (lines 860–865), the streaming-only path (~line 422), and the tool-loop accumulation (~line 1141). Before this PR, cache tokens were always zero so it was harmless; it now produces systematic underreporting on every warm-cache call.

Reviews (7): Last reviewed commit: "test(providers): add request-capture tes..." | Re-trigger Greptile

Comment thread apps/sim/providers/prompt-cache.test.ts Outdated
…tubEnv

- anthropic/core.ts: gate on request.systemPrompt instead of payload.system, so
  the no-messages path (where the system text is relocated into a user message
  and payload.system is blanked) still caches the tools prefix. (Cursor review)
- prompt-cache.test.ts: manage the kill-switch env via vi.stubEnv/unstubAllEnvs
  instead of assigning undefined (which coerces to "undefined" and leaks across
  workers). Addresses the Greptile finding while satisfying biome's noDelete rule.
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 3a44936. Configure here.

…elper

- Remove the PROMPT_CACHE_DISABLED kill switch — prompt caching is always on.
- Extract the Anthropic tagging into applyAnthropicPromptCache(payload, tools,
  systemPrompt) in anthropic/utils.ts: one place that gates and mutates the
  system block + last tool, replacing the two inline blocks in core.ts.
- Add direct unit tests for the helper (system→cached block, last-tool tagged,
  relocated/blanked-system still tags tools, below-threshold and tool-less cases
  untouched) so the actual payload mutation is verified, not just the gate.

No behavior change to outputs; verified on vitest 4.1.8 (CI's version).
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/providers/anthropic/utils.ts
Comment thread apps/sim/providers/prompt-cache.ts
…m and request prompt

Gate on max(final payload.system, request.systemPrompt) so caching fires both
when the no-messages path blanks payload.system (size via the request prompt)
and when prompt-based structured output appends a large schema to payload.system
(size via the final system string). Add a test for the schema-appended case.

Caught by Cursor Bugbot.
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 38140c7. Configure here.

Drop the inline // comments in favor of TSDoc on the helper/gate. The gate-sizing
and call-ordering rationale now lives in applyAnthropicPromptCache's TSDoc; no
behavior change.
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 5e90631. Configure here.

Comment thread apps/sim/providers/anthropic/core.ts
Drives the real executeAnthropicProviderRequest down the streaming path with only
the client injected via the createClient seam (real models/utils/attachments),
and asserts the request payload handed to messages.create carries a
cache_control-tagged system block for a large prompt and a plain string for a
small one. Closes the end-to-end wiring gap (AI-SDK-style request-body capture).
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b9a453d. Configure here.

@waleedlatif1 waleedlatif1 deleted the branch staging July 1, 2026 05:43
@waleedlatif1 waleedlatif1 reopened this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant