Why This Matters
Claude Code is an excellent AI coding assistant, but there are scenarios where you need more control:
- Data Privacy: Your code and data never leave your trusted infrastructure
- Cost Control: Use subscription-based or self-hosted models instead of pay-per-token APIs
- Air-Gapped Environments: Work in secure environments without external API access
- Model Flexibility: Use any OpenAI-compatible model (Qwen, Llama, Mistral, etc.)
Architecture Overview
The Chain:
- OpenCode - OSS Claude Code alternative with TUI, sends requests using Anthropic Messages API format
- LiteLLM - Universal proxy that translates between API formats and routes to any provider
- Model Provider - Any OpenAI-compatible endpoint (Ollama, vLLM, OpenRouter, etc.)
Tools & Versions
| Tool | Version | URL |
|---|---|---|
| OpenCode | v1.1.39 | https://github.com/anomalyco/opencode |
| LiteLLM | latest | https://github.com/BerriAI/litellm |
| uv (Python) | latest | https://github.com/astral-sh/uv |
Setup Guide
1. Install OpenCode
curl -fsSL https://opencode.ai/install | bash
2. Install LiteLLM (via uv)
# No global install needed - run directly with uv
uv run --python 3.12 --with 'litellm[proxy]' litellm --version
3. Configuration Files
.env - Environment variables (keep private, add to .gitignore):
ANTHROPIC_API_KEY=dummy
PROVIDER_API_KEY=your-actual-api-key-here
litellm_config.yaml - LiteLLM proxy configuration:
model_list:
- model_name: claude-3-7-sonnet-latest
litellm_params:
model: Qwen/Qwen3-Coder-480B-A35B-Instruct
api_base: https://your-provider-api.example.com/v1
api_key: os.environ/PROVIDER_API_KEY
custom_llm_provider: openai
- model_name: claude-3-7-sonnet-20250219
litellm_params:
model: Qwen/Qwen3-Coder-480B-A35B-Instruct
api_base: https://your-provider-api.example.com/v1
api_key: os.environ/PROVIDER_API_KEY
custom_llm_provider: openai
litellm_settings:
set_verbose: false
general_settings:
enable_jwt_auth: false
router_settings:
enable_anthropic_messages: true # Critical! Routes /v1/messages through model_list
opencode.json - OpenCode configuration:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"anthropic": {
"options": {
"baseURL": "http://localhost:4000/v1"
}
}
},
"model": "claude-3-7-sonnet-20250219",
"tui": {
"theme": "opencode"
}
}
4. Running the Stack
Terminal 1 - Start LiteLLM Proxy:
set -a && source .env && set +a
uv run --python 3.12 --with 'litellm[proxy]' litellm --config litellm_config.yaml --port 4000
Terminal 2 - Start OpenCode:
set -a && source .env && set +a
opencode
Why LiteLLM is Necessary
You might wonder: "Why not connect OpenCode directly to the OpenAI-compatible API?"
The problem is API format incompatibility:
| Client | API Format | Endpoint |
|---|---|---|
| OpenCode (Anthropic provider) | Anthropic Messages | /v1/messages |
| OpenCode (OpenAI provider) | OpenAI Responses | /v1/responses |
| Most custom providers | OpenAI Chat | /v1/chat/completions |
OpenCode v1.1.39 uses:
- Anthropic provider →
/v1/messagesformat - OpenAI provider →
/v1/responsesformat (new API, not widely supported)
Most OpenAI-compatible providers (Ollama, vLLM, etc.) only support /v1/chat/completions.
LiteLLM bridges this gap by:
- Accepting Anthropic Messages API requests on
/v1/messages - Translating them to OpenAI Chat Completions format
- Forwarding to your chosen provider
The key setting is router_settings.enable_anthropic_messages: true.
Capabilities
OpenCode provides a full coding assistant experience:
- File reading and editing
- Code generation and refactoring
- Bug fixing and debugging
- Test writing
- Web search (native support)
Our Model Choice: Qwen
We use Qwen/Qwen3-Coder-480B-A35B-Instruct:
- 480B parameters (35B active with MoE)
- Optimized for code generation
- Excellent instruction following
- Available via various cloud providers or self-hosted
Success Story: Data Analytics
We successfully used this setup for logistics data analysis:
Task: Analyze 220K+ rows of fuel consumption data across a vehicle fleet
Results:
- Parsed complex Excel files with Cyrillic column names
- Generated Python analysis scripts
- Created visualizations and HTML reports
- Identified vehicles with abnormal fuel consumption patterns
- Compared vehicles in repair vs. on-line status
The AI assistant handled the entire workflow: understanding data structure, writing pandas code, generating charts, and producing actionable insights.
Subscription-Based Model APIs
A significant advantage of this approach: unlimited usage pricing.
While major providers (OpenAI, Anthropic) charge per-token, some alternatives offer subscription models:
| Provider | Model | Pricing Model |
|---|---|---|
| Cloud providers | Qwen, Llama, etc. | Subscription tiers |
| Local Ollama | Any GGUF | Hardware cost only |
| Self-hosted vLLM | Any HF model | Infrastructure cost |
For heavy coding assistant usage—where each request consumes 10–20K input tokens plus 100–500 output tokens, and you might make thousands of requests per day—subscription models can be significantly more economical.
Self-Hosting Cost Estimates
For organizations wanting full data control, here's what self-hosting requires.
Throughput Requirements
Typical coding assistant workload:
- Input: 10,000 - 20,000 tokens per request (code context + conversation)
- Output: 100 - 500 tokens per response
- Target: Interactive response times (< 5 seconds for short responses)
Hardware Requirements by Model Size
| Model | Parameters | Active Params | Min VRAM | Recommended GPUs |
|---|---|---|---|---|
| Qwen3-Coder-8B | 8B | 8B | 16 GB | 1x RTX 4090 |
| Qwen3-Coder-32B | 32B | 32B | 64 GB | 2x RTX 4090 / 1x A100 |
| Qwen3-235B-A22B | 235B | 22B | 48 GB | 1x A100-80GB |
| Qwen3-Coder-480B-A35B | 480B | 35B | 80 GB | 1x H100 / 2x A100 |
MoE models (A22B, A35B) = only active parameters loaded during inference
Cloud GPU Rental Costs (2025-2026 estimates)
| GPU | VRAM | On-Demand (USD/hr) | Reserved (USD/month) |
|---|---|---|---|
| RTX 4090 | 24 GB | 0.40-0.80 | 200-400 |
| A100 40GB | 40 GB | 1.50-2.50 | 800-1200 |
| A100 80GB | 80 GB | 2.50-4.00 | 1200-2000 |
| H100 | 80 GB | 3.50-5.00 | 2000-3500 |
| H200 | 141 GB | 5.00-8.00 | 3000-5000 |
Cost Comparison Example
Scenario: Development team of 5, ~10M tokens/day total
| Option | Monthly Cost (USD) | Notes |
|---|---|---|
| Claude API | ~1100 | 3/M input + 15/M output |
| OpenAI GPT-4o | ~900 | 2.50/M input + 10/M output |
| Subscription API | ~100-300 | Unlimited tier |
| Self-hosted A100 | ~1500 | + setup/maintenance |
| Local RTX 4090 (8B model) | ~50 | Electricity only, lower quality |
Recommendations
- Small teams / Light usage: Use subscription APIs
- Medium teams / Privacy focus: Rent cloud GPUs with vLLM
- Enterprise / Air-gapped: On-premise H100/H200 cluster
- Experimentation: Local Ollama with smaller models (8B-32B)
Software Stack for Self-Hosting
# vLLM server (recommended for production)
pip install vllm
vllm serve Qwen/Qwen3-Coder-32B-Instruct --port 8000
# Or Ollama (easier setup)
ollama run qwen3:32b
Then point LiteLLM to your local endpoint:
model_list:
- model_name: claude-3-7-sonnet-20250219
litellm_params:
model: Qwen/Qwen3-Coder-32B-Instruct
api_base: http://localhost:8000/v1 # vLLM
# api_base: http://localhost:11434/v1 # Ollama
api_key: dummy
custom_llm_provider: openai
Conclusion
With OpenCode + LiteLLM, you can build a privacy-focused, cost-effective alternative to Claude Code that:
- Keeps your data within trusted infrastructure
- Works with any OpenAI-compatible model provider
- Provides full coding assistant capabilities
- Scales from local Ollama to enterprise GPU clusters
The key insight is using LiteLLM's enable_anthropic_messages setting to bridge OpenCode's Anthropic API format to standard OpenAI-compatible providers.
