Back to Blog
AI & Automation9 min read

Semantic Caching vs Prompt Caching: You Probably Need Both

Alex Ozhima
|March 13, 2026

Most teams think semantic caching and prompt caching are alternatives.

They're not. They solve different problems — and you probably need both.

How LLM Inference Actually Works: Prefill and Decode

Before diving into caching strategies, you need to understand the two phases of every LLM request.

Prefill phase: The model processes your entire input (system prompt, context, user query) in parallel. Every input token gets converted into key-value (KV) tensors through the attention layers. This is the computationally expensive part — it scales with input length and is the main driver of time-to-first-token (TTFT) latency.

Decode phase: The model generates output tokens one at a time, each depending on the previous token plus the KV cache from prefill. This phase determines tokens-per-second throughput and is inherently sequential.

Why this matters: prompt caching targets the prefill phase. If your system prompt and context are identical across requests, those KV tensors don't need to be recomputed. The model skips straight to processing the new tokens and starts decoding faster.

Semantic caching skips both phases entirely — it never calls the LLM at all.

Semantic Caching: Skip the LLM Call Entirely

Semantic caching works at the application layer, before any LLM request is made.

How it works:

  1. Convert the user query to an embedding vector (768-1536 dimensions)
  2. Search a vector store for similar past queries using cosine similarity (threshold ~0.92-0.95)
  3. If a match is found, return the cached response directly
  4. If no match, call the LLM and store the query-response pair

"What's your refund policy?" matches "Can I get my money back?" — same cached answer, no LLM call needed.

Performance characteristics:

MetricCache HitCache Miss
Latency10-50ms1-5 seconds
LLM API cost$0Full price
ComputeEmbedding + vector search onlyFull inference

Best for: Repetitive queries with natural paraphrasing — FAQs, support bots, RAG systems where many users ask the same questions differently.

Limitations: Only works when queries are semantically equivalent. A slightly different question needs a fresh LLM call. The similarity threshold is critical — too low and you return wrong answers, too high and you rarely get cache hits.

Prompt Caching: Optimize the Prefill Phase

Prompt caching operates inside the LLM provider's infrastructure. It stores the computed KV tensors from the prefill phase so they don't need to be recomputed on subsequent requests.

How it works:

  1. On the first request, the provider computes KV tensors for your full input and caches them
  2. On subsequent requests with the same prefix, the provider loads the cached KV tensors directly
  3. Only new tokens (the part that changed) go through full prefill computation
  4. The model still runs — decode phase happens normally

The critical requirement: byte-for-byte prefix matching. The cache only works when your input starts with the exact same sequence of tokens. This means your system prompt must always come first, followed by any stable context (documents, conversation history), with the varying user query at the end.

System prompt placement matters. Your system prompt must always be the first content in every request. If you put the user message before the system prompt, or vary the system prompt between requests, you break the prefix match and get zero cache benefit. Structure every request as: system prompt → stable context → variable content.

Performance characteristics:

MetricWith Prompt CacheWithout
TTFT Latency10-85% fasterBaseline
Input token cost50-90% discountFull price
Output token costSameSame

Best for: Large stable context — system prompts, knowledge bases, few-shot examples, conversation history — that stays the same across requests while only the latest query changes.

The Break-Even Math: Why It Almost Always Pays Off

Anthropic's prompt caching pricing is designed so that it pays for itself almost immediately.

The pricing structure (Claude Sonnet):

Token TypePrice per 1M tokensMultiplier
Standard input$3.001x
Cache write (5-min TTL)$3.751.25x
Cache read$0.300.1x

The first time you send a prompt, you pay a 25% premium (0.75/Mextra)towriteittocache.Everysubsequentreadsavesyou0.75/M extra) to write it to cache. Every subsequent read saves you **2.70/M** (3.003.00 - 0.30).

Break-even calculation:

Extra write cost: 0.75/Mtokens.Savingsperread:0.75/M tokens. Savings per read: 2.70/M tokens.

Reads to recover the write cost: 0.75/0.75 / 2.70 = 0.28 reads.

Add the initial write itself and you need ~1.28 total requests (rounded to ~1.4) to break even. By the second request hitting the same prefix, you're already saving money.

For a 2,000-token system prompt at $3.00/M input:

  • Request 1 (write): 0.0075(vs0.0075 (vs 0.006 without caching) — you paid $0.0015 extra
  • Request 2 (read): 0.0006(vs0.0006 (vs 0.006 without caching) — you saved $0.0054
  • Net after 2 requests: $0.0039 saved

That's why the break-even is essentially request #2. If your system prompt is used more than once within the TTL window — and it almost always is — prompt caching is free money.

Layering Both: The Production Architecture

The real insight is that these aren't competing strategies. Production systems layer them:

Example: Customer support agent with a 100K token knowledge base.

Semantic cache catches "How do I reset my password?" vs "I forgot my password" — same cached answer, zero LLM cost. For new questions that miss the semantic cache, prompt cache prevents reprocessing the entire 100K knowledge base on every request.

The math on a 100K request/month chatbot with a 2K token system prompt:

StrategyMonthly Cost
No caching~$600
Prompt caching only (90% hit rate)~$114
Semantic + prompt caching~$40-60

Which LLMs Support Prompt Caching?

Every major provider now offers some form of prompt caching, but the implementations differ significantly.

ProviderFeature NameRead DiscountCache Write CostMin TokensTTLAutomatic?
AnthropicPrompt Caching90%1.25x (5-min) / 2x (1-hr)1,024-4,0965 min or 1 hrBoth
OpenAIPrompt Caching50-90% (by model)None1,0245-10 min (or 24 hr)Automatic
GoogleContext Caching75-90% (by model)Storage: $/hr/M tokens1,024-32,768ConfigurableBoth
DeepSeekPrefix Caching90%NoneNot specifiedDisk-backedAutomatic
xAI (Grok)Prompt Caching~75%NoneNot specified~5 minAutomatic
MistralCache Read~50%Not documentedNot documentedNot documentedLikely automatic

Key differences to note:

Write costs vs. free caching. Anthropic charges a premium to write to cache (1.25x-2x). Google charges ongoing storage fees per hour. OpenAI, DeepSeek, and xAI charge nothing extra — you only pay less on cache hits, never more on misses.

Automatic vs. explicit. OpenAI, DeepSeek, and xAI are fully automatic — no code changes needed. Anthropic and Google offer both automatic and explicit modes. Explicit mode gives more control through cache breakpoints but requires API changes.

Discount depth. Anthropic and DeepSeek offer the deepest discounts at 90%. OpenAI ranges from 50% (GPT-4o) to 90% (GPT-5). Google offers 75-90% depending on model generation.

Absolute cost. DeepSeek's cache hit price of $0.028/M tokens is dramatically cheaper than any competitor — even cheaper than other providers' cached prices.

TTL tradeoffs. Most default to 5-10 minutes, refreshed on each hit. Anthropic offers a paid 1-hour TTL. Google lets you set arbitrary TTLs with storage costs. OpenAI's newer models support 24-hour retention via GPU-local KV storage.

Implementation Checklist

If you're building LLM applications, here's how to layer both caching strategies:

Semantic caching layer (application-side):

  1. Choose an embedding model (e.g., OpenAI text-embedding-3-small, Cohere embed-v4)
  2. Set up a vector store (Pinecone, Qdrant, pgvector, or even in-memory for small scale)
  3. Set your similarity threshold carefully — start at 0.95 and lower only with testing
  4. Cache responses with metadata (timestamp, query, model version)
  5. Implement cache invalidation when your knowledge base updates

Prompt caching layer (API-side):

  1. Structure every request correctly: system prompt → tools → stable context → variable content
  2. Keep your system prompt identical across all requests — even whitespace changes break the cache
  3. For Anthropic: add cache_control breakpoints or use automatic caching
  4. For OpenAI/DeepSeek: it just works — no code changes needed
  5. Monitor cache hit rates in your provider dashboard

The Bottom Line

Semantic caching and prompt caching aren't alternatives — they're complementary layers in a cost-optimization stack.

Semantic caching eliminates redundant LLM calls entirely. Prompt caching makes the remaining calls cheaper and faster. Together, they can cut your LLM costs by 40-90% while improving latency.

If you're spending more than a few hundred dollars a month on LLM APIs, you should be using both.


Need help optimizing your LLM infrastructure costs? Let's talk.

Alex Ozhima

Alex Ozhima

Founder & CEO at Katlextech

Ready to Ship Your Product?

Let's discuss how we can implement these strategies for your business