The Fundamental Truth About LLMs
Here's something that's obvious once you see it, but often overlooked: LLM output quality is entirely dependent on input quality. The model doesn't "know" things—it pattern-matches against whatever context you provide.
This has profound implications:
- A generic prompt gets a generic answer
- A prompt with relevant context gets a relevant answer
- A prompt with exactly the right context gets an expert-level answer
The entire field of "prompt engineering" is really just context engineering.
The Context Window Problem
Every LLM has a context window—the maximum amount of text it can process at once:
| Model | Context Window | Approximate Words |
|---|---|---|
| GPT-3.5 | 16K tokens | ~12,000 words |
| GPT-4o | 128K tokens | ~96,000 words |
| Claude 3.5 Sonnet | 200K tokens | ~150,000 words |
| Gemini 1.5 Pro | 2M tokens | ~1.5M words |
Sounds like a lot? Consider:
- A medium codebase: 500K+ lines
- Company documentation: millions of words
- Customer support history: terabytes
- Legal contracts archive: endless
You can never fit everything into context. So the question becomes: what do you load?
Context Mining: The Core Challenge
Context mining is the process of extracting the most relevant information from a large corpus to include in an LLM prompt. It's the difference between:
Without context mining:
User: "How do I fix the authentication bug?"
LLM: "Generally, authentication bugs can be caused by..."
*generic, unhelpful response*
With context mining:
User: "How do I fix the authentication bug?"
System: *retrieves relevant code, recent commits, error logs, docs*
LLM: "Looking at auth.ts:47, the JWT validation is failing because
the token expiry check uses milliseconds but your config
specifies seconds. Change line 52 to..."
*specific, actionable response*
RAG: Retrieval-Augmented Generation
RAG is the formal name for automated context mining. The architecture:
The process:
- Embed your corpus - Convert documents to vectors
- Query arrives - User asks a question
- Semantic search - Find vectors similar to the query
- Retrieve documents - Pull the actual content
- Assemble context - Build the prompt with retrieved info
- Generate response - LLM answers with the context
RAG isn't magic—it's just automated, intelligent context loading.
The Retrieval Quality Problem
Here's where most RAG implementations fail: retrieval quality directly determines response quality.
Failure Mode 1: Wrong Chunks
# Naive chunking by character count
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
# Result: Splits mid-sentence, loses context
# "...the authentication flow requires valid JWT tokens which..."
# becomes:
# Chunk 1: "...the authentication flow requires valid"
# Chunk 2: "JWT tokens which..."
Better approach: semantic chunking that respects document structure.
Failure Mode 2: Retrieval Misses
Semantic search can miss relevant content when:
- Query uses different terminology than the document
- Critical info is in a low-similarity chunk
- Document structure isn't captured in embeddings
Solution: hybrid retrieval (semantic + keyword), query expansion, and re-ranking.
Failure Mode 3: Context Overload
Retrieving 20 documents when 3 would suffice:
- Wastes context window space
- Confuses the model with irrelevant info
- Increases latency and cost
Solution: relevance thresholds, diversity sampling, and context compression.
Advanced Context Mining Strategies
1. Multi-Stage Retrieval
First stage: retrieve 50 candidates quickly Second stage: re-rank with a cross-encoder Third stage: select the top 5 with diversity
2. Agentic Retrieval
Let the LLM decide what context it needs:
LLM: "To answer this, I need:
1. The current implementation of auth.ts
2. Recent changes to the JWT config
3. Any related error logs from the past week"
System: *retrieves each requested item*
LLM: *now has exactly the context it needs*
This is how coding assistants like Claude Code work—they request files and context dynamically.
3. Contextual Compression
Before stuffing documents into context, compress them:
Original (500 tokens):
"The authentication system was designed in Q3 2024 by the
security team. It uses JWT tokens for session management.
The tokens are signed with RS256 algorithm. Expiry is set
to 24 hours by default. The refresh token mechanism..."
Compressed (50 tokens):
"Auth uses JWT with RS256 signing, 24-hour expiry,
refresh tokens for session extension."
Compression ratios of 10:1 are achievable without information loss for Q&A tasks.
4. Structured Context
Instead of dumping text, structure the context:
<context>
<current_file path="src/auth/jwt.ts">
... relevant code ...
</current_file>
<related_docs>
<doc source="auth-architecture.md" relevance="0.92">
... key excerpts ...
</doc>
</related_docs>
<recent_changes>
- commit abc123: "Fixed token expiry calculation"
- commit def456: "Updated JWT library to v2.1"
</recent_changes>
</context>
Structure helps the LLM understand what each piece of context represents.
Context Mining for Code
Coding assistants have specific context needs:
| Context Type | Why It Matters |
|---|---|
| Current file | Immediate context for edits |
| Imports/dependencies | Understanding what's available |
| Type definitions | Correct function signatures |
| Related files | How components interact |
| Git history | Recent changes, who to ask |
| Error messages | Specific failure context |
| Test files | Expected behavior |
The best coding assistants dynamically assemble this context based on the task.
Building Your Own Context Mining Pipeline
Minimum Viable RAG
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# 1. Embed your documents
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = ["doc1 content...", "doc2 content...", ...]
embeddings = model.encode(docs)
# 2. Build index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
# 3. Query
query_embedding = model.encode(["user question"])
distances, indices = index.search(query_embedding, k=5)
# 4. Build prompt with retrieved docs
context = "\n".join([docs[i] for i in indices[0]])
prompt = f"Context:\n{context}\n\nQuestion: {user_question}"
Production Considerations
For production systems, add:
- Chunking strategy - Respect document structure
- Metadata filtering - Filter by date, source, type
- Hybrid search - Combine semantic + keyword
- Re-ranking - Cross-encoder for precision
- Caching - Avoid redundant embeddings
- Monitoring - Track retrieval quality metrics
The Insight: Context is Everything
The profound realization is that LLMs are context processing machines. They don't have knowledge—they have pattern matching against context.
This means:
- Better context > Better prompts - Optimizing what goes in beats optimizing how you ask
- RAG is not optional - For domain-specific tasks, retrieval is mandatory
- Context quality is measurable - You can A/B test retrieval strategies
- The moat is in the data - Your proprietary context is your competitive advantage
Conclusion
Context mining automation—whether you call it RAG, retrieval, or smart context loading—is the difference between a generic AI toy and a domain expert assistant.
The best AI systems don't just have better models. They have better context pipelines.
Focus on:
- What context exists
- How to retrieve the right pieces
- How to assemble it effectively
- How to measure retrieval quality
Master context mining, and you've mastered the key to useful AI systems.
