Anthropic gives you four enormous cost levers — cache reads at ~10% of input price, a flat −50% Batch API, a 10× price spread across the model lineup, and effort control. Almost nobody pulls them correctly. This playbook is the implementation manual: the exact mechanics, the failure modes, the eval gates, and the code.
Prompt caching is a prefix match on exact bytes. One changed byte at position N invalidates every cache breakpoint at or after N. When teams tell me "we turned on caching and saved nothing," the cause is always one of six patterns sitting in the prompt-assembly path:
The verification loop: check usage.cache_read_input_tokens on every response in staging. If it's zero on the second identical-prefix request, diff the rendered prompt bytes between the two requests — the invalidator will be staring at you. The full chapter covers breakpoint placement patterns, 5-minute vs 1-hour TTL break-even math, and cache pre-warming with max_tokens: 0.
The Batch API is the least glamorous and most reliable lever: a flat 50% discount on every token, for accepting asynchronous delivery (typically under an hour, 24h worst case). The audit question is not "what is a batch job" — it's "which of my requests does a human actually wait for?" Everything else is batchable: nightly enrichment, embeddings-adjacent classification, eval runs, report generation, digest emails, backfills, re-scoring. Teams routinely discover 20–40% of traffic qualifies. Combined with caching, discounts compound: batched cache reads bill at 50% of the ~10% read price.