Most teams think semantic caching and prompt caching are alternatives.
They're not. They solve different problems — and you probably need both.
How LLM Inference Actually Works: Prefill and Decode
Before diving into caching strategies, you need to understand the two phases of every LLM request.
Prefill phase: The model processes your entire input (system prompt, context, user query) in parallel. Every input token gets converted into key-value (KV) tensors through the attention layers. This is the computationally expensive part — it scales with input length and is the main driver of time-to-first-token (TTFT) latency.
Decode phase: The model generates output tokens one at a time, each depending on the previous token plus the KV cache from prefill. This phase determines tokens-per-second throughput and is inherently sequential.
Why this matters: prompt caching targets the prefill phase. If your system prompt and context are identical across requests, those KV tensors don't need to be recomputed. The model skips straight to processing the new tokens and starts decoding faster.
Semantic caching skips both phases entirely — it never calls the LLM at all.
Semantic Caching: Skip the LLM Call Entirely
Semantic caching works at the application layer, before any LLM request is made.
How it works:
- Convert the user query to an embedding vector (768-1536 dimensions)
- Search a vector store for similar past queries using cosine similarity (threshold ~0.92-0.95)
- If a match is found, return the cached response directly
- If no match, call the LLM and store the query-response pair
"What's your refund policy?" matches "Can I get my money back?" — same cached answer, no LLM call needed.
Performance characteristics:
| Metric | Cache Hit | Cache Miss |
|---|---|---|
| Latency | 10-50ms | 1-5 seconds |
| LLM API cost | $0 | Full price |
| Compute | Embedding + vector search only | Full inference |
Best for: Repetitive queries with natural paraphrasing — FAQs, support bots, RAG systems where many users ask the same questions differently.
Limitations: Only works when queries are semantically equivalent. A slightly different question needs a fresh LLM call. The similarity threshold is critical — too low and you return wrong answers, too high and you rarely get cache hits.
Prompt Caching: Optimize the Prefill Phase
Prompt caching operates inside the LLM provider's infrastructure. It stores the computed KV tensors from the prefill phase so they don't need to be recomputed on subsequent requests.
How it works:
- On the first request, the provider computes KV tensors for your full input and caches them
- On subsequent requests with the same prefix, the provider loads the cached KV tensors directly
- Only new tokens (the part that changed) go through full prefill computation
- The model still runs — decode phase happens normally
The critical requirement: byte-for-byte prefix matching. The cache only works when your input starts with the exact same sequence of tokens. This means your system prompt must always come first, followed by any stable context (documents, conversation history), with the varying user query at the end.
System prompt placement matters. Your system prompt must always be the first content in every request. If you put the user message before the system prompt, or vary the system prompt between requests, you break the prefix match and get zero cache benefit. Structure every request as: system prompt → stable context → variable content.
Performance characteristics:
| Metric | With Prompt Cache | Without |
|---|---|---|
| TTFT Latency | 10-85% faster | Baseline |
| Input token cost | 50-90% discount | Full price |
| Output token cost | Same | Same |
Best for: Large stable context — system prompts, knowledge bases, few-shot examples, conversation history — that stays the same across requests while only the latest query changes.
The Break-Even Math: Why It Almost Always Pays Off
Anthropic's prompt caching pricing is designed so that it pays for itself almost immediately.
The pricing structure (Claude Sonnet):
| Token Type | Price per 1M tokens | Multiplier |
|---|---|---|
| Standard input | $3.00 | 1x |
| Cache write (5-min TTL) | $3.75 | 1.25x |
| Cache read | $0.30 | 0.1x |
The first time you send a prompt, you pay a 25% premium (2.70/M** (0.30).
Break-even calculation:
Extra write cost: 2.70/M tokens.
Reads to recover the write cost: 2.70 = 0.28 reads.
Add the initial write itself and you need ~1.28 total requests (rounded to ~1.4) to break even. By the second request hitting the same prefix, you're already saving money.
For a 2,000-token system prompt at $3.00/M input:
- Request 1 (write): 0.006 without caching) — you paid $0.0015 extra
- Request 2 (read): 0.006 without caching) — you saved $0.0054
- Net after 2 requests: $0.0039 saved
That's why the break-even is essentially request #2. If your system prompt is used more than once within the TTL window — and it almost always is — prompt caching is free money.
Layering Both: The Production Architecture
The real insight is that these aren't competing strategies. Production systems layer them:
Example: Customer support agent with a 100K token knowledge base.
Semantic cache catches "How do I reset my password?" vs "I forgot my password" — same cached answer, zero LLM cost. For new questions that miss the semantic cache, prompt cache prevents reprocessing the entire 100K knowledge base on every request.
The math on a 100K request/month chatbot with a 2K token system prompt:
| Strategy | Monthly Cost |
|---|---|
| No caching | ~$600 |
| Prompt caching only (90% hit rate) | ~$114 |
| Semantic + prompt caching | ~$40-60 |
Which LLMs Support Prompt Caching?
Every major provider now offers some form of prompt caching, but the implementations differ significantly.
| Provider | Feature Name | Read Discount | Cache Write Cost | Min Tokens | TTL | Automatic? |
|---|---|---|---|---|---|---|
| Anthropic | Prompt Caching | 90% | 1.25x (5-min) / 2x (1-hr) | 1,024-4,096 | 5 min or 1 hr | Both |
| OpenAI | Prompt Caching | 50-90% (by model) | None | 1,024 | 5-10 min (or 24 hr) | Automatic |
| Context Caching | 75-90% (by model) | Storage: $/hr/M tokens | 1,024-32,768 | Configurable | Both | |
| DeepSeek | Prefix Caching | 90% | None | Not specified | Disk-backed | Automatic |
| xAI (Grok) | Prompt Caching | ~75% | None | Not specified | ~5 min | Automatic |
| Mistral | Cache Read | ~50% | Not documented | Not documented | Not documented | Likely automatic |
Key differences to note:
Write costs vs. free caching. Anthropic charges a premium to write to cache (1.25x-2x). Google charges ongoing storage fees per hour. OpenAI, DeepSeek, and xAI charge nothing extra — you only pay less on cache hits, never more on misses.
Automatic vs. explicit. OpenAI, DeepSeek, and xAI are fully automatic — no code changes needed. Anthropic and Google offer both automatic and explicit modes. Explicit mode gives more control through cache breakpoints but requires API changes.
Discount depth. Anthropic and DeepSeek offer the deepest discounts at 90%. OpenAI ranges from 50% (GPT-4o) to 90% (GPT-5). Google offers 75-90% depending on model generation.
Absolute cost. DeepSeek's cache hit price of $0.028/M tokens is dramatically cheaper than any competitor — even cheaper than other providers' cached prices.
TTL tradeoffs. Most default to 5-10 minutes, refreshed on each hit. Anthropic offers a paid 1-hour TTL. Google lets you set arbitrary TTLs with storage costs. OpenAI's newer models support 24-hour retention via GPU-local KV storage.
Implementation Checklist
If you're building LLM applications, here's how to layer both caching strategies:
Semantic caching layer (application-side):
- Choose an embedding model (e.g., OpenAI
text-embedding-3-small, Cohereembed-v4) - Set up a vector store (Pinecone, Qdrant, pgvector, or even in-memory for small scale)
- Set your similarity threshold carefully — start at 0.95 and lower only with testing
- Cache responses with metadata (timestamp, query, model version)
- Implement cache invalidation when your knowledge base updates
Prompt caching layer (API-side):
- Structure every request correctly: system prompt → tools → stable context → variable content
- Keep your system prompt identical across all requests — even whitespace changes break the cache
- For Anthropic: add
cache_controlbreakpoints or use automatic caching - For OpenAI/DeepSeek: it just works — no code changes needed
- Monitor cache hit rates in your provider dashboard
The Bottom Line
Semantic caching and prompt caching aren't alternatives — they're complementary layers in a cost-optimization stack.
Semantic caching eliminates redundant LLM calls entirely. Prompt caching makes the remaining calls cheaper and faster. Together, they can cut your LLM costs by 40-90% while improving latency.
If you're spending more than a few hundred dollars a month on LLM APIs, you should be using both.
Need help optimizing your LLM infrastructure costs? Let's talk.
