Semantic Caching vs Prompt Caching: You Probably Need Both

Most teams think semantic caching and prompt caching are alternatives.

They're not. They solve different problems — and you probably need both.

How LLM Inference Actually Works: Prefill and Decode

Before diving into caching strategies, you need to understand the two phases of every LLM request.

Prefill phase: The model processes your entire input (system prompt, context, user query) in parallel. Every input token gets converted into key-value (KV) tensors through the attention layers. This is the computationally expensive part — it scales with input length and is the main driver of time-to-first-token (TTFT) latency.

Decode phase: The model generates output tokens one at a time, each depending on the previous token plus the KV cache from prefill. This phase determines tokens-per-second throughput and is inherently sequential.

Why this matters: prompt caching targets the prefill phase. If your system prompt and context are identical across requests, those KV tensors don't need to be recomputed. The model skips straight to processing the new tokens and starts decoding faster.

Semantic caching skips both phases entirely — it never calls the LLM at all.

Semantic Caching: Skip the LLM Call Entirely

Semantic caching works at the application layer, before any LLM request is made.

How it works:

Convert the user query to an embedding vector (768-1536 dimensions)
Search a vector store for similar past queries using cosine similarity (threshold ~0.92-0.95)
If a match is found, return the cached response directly
If no match, call the LLM and store the query-response pair

"What's your refund policy?" matches "Can I get my money back?" — same cached answer, no LLM call needed.

Performance characteristics:

Metric	Cache Hit	Cache Miss
Latency	10-50ms	1-5 seconds
LLM API cost	$0	Full price
Compute	Embedding + vector search only	Full inference

Best for: Repetitive queries with natural paraphrasing — FAQs, support bots, RAG systems where many users ask the same questions differently.

Limitations: Only works when queries are semantically equivalent. A slightly different question needs a fresh LLM call. The similarity threshold is critical — too low and you return wrong answers, too high and you rarely get cache hits.

Prompt Caching: Optimize the Prefill Phase

Prompt caching operates inside the LLM provider's infrastructure. It stores the computed KV tensors from the prefill phase so they don't need to be recomputed on subsequent requests.

How it works:

On the first request, the provider computes KV tensors for your full input and caches them
On subsequent requests with the same prefix, the provider loads the cached KV tensors directly
Only new tokens (the part that changed) go through full prefill computation
The model still runs — decode phase happens normally

The critical requirement: byte-for-byte prefix matching. The cache only works when your input starts with the exact same sequence of tokens. This means your system prompt must always come first, followed by any stable context (documents, conversation history), with the varying user query at the end.

System prompt placement matters. Your system prompt must always be the first content in every request. If you put the user message before the system prompt, or vary the system prompt between requests, you break the prefix match and get zero cache benefit. Structure every request as: system prompt → stable context → variable content.

Performance characteristics:

Metric	With Prompt Cache	Without
TTFT Latency	10-85% faster	Baseline
Input token cost	50-90% discount	Full price
Output token cost	Same	Same

Best for: Large stable context — system prompts, knowledge bases, few-shot examples, conversation history — that stays the same across requests while only the latest query changes.

The Break-Even Math: Why It Almost Always Pays Off

Anthropic's prompt caching pricing is designed so that it pays for itself almost immediately.

The pricing structure (Claude Sonnet):

Token Type	Price per 1M tokens	Multiplier
Standard input	$3.00	1x
Cache write (5-min TTL)	$3.75	1.25x
Cache read	$0.30	0.1x

The first time you send a prompt, you pay a 25% premium ( $0.75/M extra) to write it to cache. Every subsequent read saves you **$ 2.70/M** ( $3.00 -$ 0.30).

Break-even calculation:

Extra write cost: $0.75/M tokens. Savings per read:$ 2.70/M tokens.

Reads to recover the write cost: $0.75 /$ 2.70 = 0.28 reads.

Add the initial write itself and you need ~1.28 total requests (rounded to ~1.4) to break even. By the second request hitting the same prefix, you're already saving money.

For a 2,000-token system prompt at $3.00/M input:

Request 1 (write): $0.0075 (vs$ 0.006 without caching) — you paid $0.0015 extra
Request 2 (read): $0.0006 (vs$ 0.006 without caching) — you saved $0.0054
Net after 2 requests: $0.0039 saved

That's why the break-even is essentially request #2. If your system prompt is used more than once within the TTL window — and it almost always is — prompt caching is free money.

Layering Both: The Production Architecture

The real insight is that these aren't competing strategies. Production systems layer them:

Example: Customer support agent with a 100K token knowledge base.

Semantic cache catches "How do I reset my password?" vs "I forgot my password" — same cached answer, zero LLM cost. For new questions that miss the semantic cache, prompt cache prevents reprocessing the entire 100K knowledge base on every request.

The math on a 100K request/month chatbot with a 2K token system prompt:

Strategy	Monthly Cost
No caching	~$600
Prompt caching only (90% hit rate)	~$114
Semantic + prompt caching	~$40-60

Which LLMs Support Prompt Caching?

Every major provider now offers some form of prompt caching, but the implementations differ significantly.

Provider	Feature Name	Read Discount	Cache Write Cost	Min Tokens	TTL	Automatic?
Anthropic	Prompt Caching	90%	1.25x (5-min) / 2x (1-hr)	1,024-4,096	5 min or 1 hr	Both
OpenAI	Prompt Caching	50-90% (by model)	None	1,024	5-10 min (or 24 hr)	Automatic
Google	Context Caching	75-90% (by model)	Storage: $/hr/M tokens	1,024-32,768	Configurable	Both
DeepSeek	Prefix Caching	90%	None	Not specified	Disk-backed	Automatic
xAI (Grok)	Prompt Caching	~75%	None	Not specified	~5 min	Automatic
Mistral	Cache Read	~50%	Not documented	Not documented	Not documented	Likely automatic

Key differences to note:

Write costs vs. free caching. Anthropic charges a premium to write to cache (1.25x-2x). Google charges ongoing storage fees per hour. OpenAI, DeepSeek, and xAI charge nothing extra — you only pay less on cache hits, never more on misses.

Automatic vs. explicit. OpenAI, DeepSeek, and xAI are fully automatic — no code changes needed. Anthropic and Google offer both automatic and explicit modes. Explicit mode gives more control through cache breakpoints but requires API changes.

Discount depth. Anthropic and DeepSeek offer the deepest discounts at 90%. OpenAI ranges from 50% (GPT-4o) to 90% (GPT-5). Google offers 75-90% depending on model generation.

Absolute cost. DeepSeek's cache hit price of $0.028/M tokens is dramatically cheaper than any competitor — even cheaper than other providers' cached prices.

TTL tradeoffs. Most default to 5-10 minutes, refreshed on each hit. Anthropic offers a paid 1-hour TTL. Google lets you set arbitrary TTLs with storage costs. OpenAI's newer models support 24-hour retention via GPU-local KV storage.

Implementation Checklist

If you're building LLM applications, here's how to layer both caching strategies:

Semantic caching layer (application-side):

Choose an embedding model (e.g., OpenAI text-embedding-3-small, Cohere embed-v4)
Set up a vector store (Pinecone, Qdrant, pgvector, or even in-memory for small scale)
Set your similarity threshold carefully — start at 0.95 and lower only with testing
Cache responses with metadata (timestamp, query, model version)
Implement cache invalidation when your knowledge base updates

Prompt caching layer (API-side):

Structure every request correctly: system prompt → tools → stable context → variable content
Keep your system prompt identical across all requests — even whitespace changes break the cache
For Anthropic: add cache_control breakpoints or use automatic caching
For OpenAI/DeepSeek: it just works — no code changes needed
Monitor cache hit rates in your provider dashboard

The Bottom Line

Semantic caching and prompt caching aren't alternatives — they're complementary layers in a cost-optimization stack.

Semantic caching eliminates redundant LLM calls entirely. Prompt caching makes the remaining calls cheaper and faster. Together, they can cut your LLM costs by 40-90% while improving latency.

If you're spending more than a few hundred dollars a month on LLM APIs, you should be using both.

Need help optimizing your LLM infrastructure costs? Let's talk.

Semantic Caching vs Prompt Caching: You Probably Need Both

How LLM Inference Actually Works: Prefill and Decode

Semantic Caching: Skip the LLM Call Entirely

Prompt Caching: Optimize the Prefill Phase

The Break-Even Math: Why It Almost Always Pays Off

Layering Both: The Production Architecture

Which LLMs Support Prompt Caching?

Implementation Checklist

The Bottom Line

Alex Ozhima

Related Articles

A Senior Rust Playbook for Claude Code, in One File

Drop an AI Agent Into Any App: A Tech-Agnostic Architecture with ACP and MCP

Ready to Ship Your Product?

TOP 3% TALENT