OSS Light Instance of Claude Code: A Privacy-Focused AI Coding Assistant

Why This Matters

Claude Code is an excellent AI coding assistant, but there are scenarios where you need more control:

Data Privacy: Your code and data never leave your trusted infrastructure
Cost Control: Use subscription-based or self-hosted models instead of pay-per-token APIs
Air-Gapped Environments: Work in secure environments without external API access
Model Flexibility: Use any OpenAI-compatible model (Qwen, Llama, Mistral, etc.)

Architecture Overview

The Chain:

OpenCode - OSS Claude Code alternative with TUI, sends requests using Anthropic Messages API format
LiteLLM - Universal proxy that translates between API formats and routes to any provider
Model Provider - Any OpenAI-compatible endpoint (Ollama, vLLM, OpenRouter, etc.)

Tools & Versions

Tool	Version	URL
OpenCode	v1.1.39	https://github.com/anomalyco/opencode
LiteLLM	latest	https://github.com/BerriAI/litellm
uv (Python)	latest	https://github.com/astral-sh/uv

Setup Guide

1. Install OpenCode

curl -fsSL https://opencode.ai/install | bash

2. Install LiteLLM (via uv)

# No global install needed - run directly with uv
uv run --python 3.12 --with 'litellm[proxy]' litellm --version

3. Configuration Files

.env - Environment variables (keep private, add to .gitignore):

ANTHROPIC_API_KEY=dummy
PROVIDER_API_KEY=your-actual-api-key-here

litellm_config.yaml - LiteLLM proxy configuration:

model_list:
  - model_name: claude-3-7-sonnet-latest
    litellm_params:
      model: Qwen/Qwen3-Coder-480B-A35B-Instruct
      api_base: https://your-provider-api.example.com/v1
      api_key: os.environ/PROVIDER_API_KEY
      custom_llm_provider: openai

  - model_name: claude-3-7-sonnet-20250219
    litellm_params:
      model: Qwen/Qwen3-Coder-480B-A35B-Instruct
      api_base: https://your-provider-api.example.com/v1
      api_key: os.environ/PROVIDER_API_KEY
      custom_llm_provider: openai

litellm_settings:
  set_verbose: false

general_settings:
  enable_jwt_auth: false

router_settings:
  enable_anthropic_messages: true  # Critical! Routes /v1/messages through model_list

opencode.json - OpenCode configuration:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "anthropic": {
      "options": {
        "baseURL": "http://localhost:4000/v1"
      }
    }
  },
  "model": "claude-3-7-sonnet-20250219",
  "tui": {
    "theme": "opencode"
  }
}

4. Running the Stack

Terminal 1 - Start LiteLLM Proxy:

set -a && source .env && set +a
uv run --python 3.12 --with 'litellm[proxy]' litellm --config litellm_config.yaml --port 4000

Terminal 2 - Start OpenCode:

set -a && source .env && set +a
opencode

Why LiteLLM is Necessary

You might wonder: "Why not connect OpenCode directly to the OpenAI-compatible API?"

The problem is API format incompatibility:

Client	API Format	Endpoint
OpenCode (Anthropic provider)	Anthropic Messages	`/v1/messages`
OpenCode (OpenAI provider)	OpenAI Responses	`/v1/responses`
Most custom providers	OpenAI Chat	`/v1/chat/completions`

OpenCode v1.1.39 uses:

Anthropic provider → /v1/messages format
OpenAI provider → /v1/responses format (new API, not widely supported)

Most OpenAI-compatible providers (Ollama, vLLM, etc.) only support /v1/chat/completions.

LiteLLM bridges this gap by:

Accepting Anthropic Messages API requests on /v1/messages
Translating them to OpenAI Chat Completions format
Forwarding to your chosen provider

The key setting is router_settings.enable_anthropic_messages: true.

Capabilities

OpenCode provides a full coding assistant experience:

File reading and editing
Code generation and refactoring
Bug fixing and debugging
Test writing
Web search (native support)

Our Model Choice: Qwen

We use Qwen/Qwen3-Coder-480B-A35B-Instruct:

480B parameters (35B active with MoE)
Optimized for code generation
Excellent instruction following
Available via various cloud providers or self-hosted

Success Story: Data Analytics

We successfully used this setup for logistics data analysis:

Task: Analyze 220K+ rows of fuel consumption data across a vehicle fleet

Results:

Parsed complex Excel files with Cyrillic column names
Generated Python analysis scripts
Created visualizations and HTML reports
Identified vehicles with abnormal fuel consumption patterns
Compared vehicles in repair vs. on-line status

The AI assistant handled the entire workflow: understanding data structure, writing pandas code, generating charts, and producing actionable insights.

Subscription-Based Model APIs

A significant advantage of this approach: unlimited usage pricing.

While major providers (OpenAI, Anthropic) charge per-token, some alternatives offer subscription models:

Provider	Model	Pricing Model
Cloud providers	Qwen, Llama, etc.	Subscription tiers
Local Ollama	Any GGUF	Hardware cost only
Self-hosted vLLM	Any HF model	Infrastructure cost

For heavy coding assistant usage—where each request consumes 10–20K input tokens plus 100–500 output tokens, and you might make thousands of requests per day—subscription models can be significantly more economical.

Self-Hosting Cost Estimates

For organizations wanting full data control, here's what self-hosting requires.

Throughput Requirements

Typical coding assistant workload:

Input: 10,000 - 20,000 tokens per request (code context + conversation)
Output: 100 - 500 tokens per response
Target: Interactive response times (< 5 seconds for short responses)

Hardware Requirements by Model Size

Model	Parameters	Active Params	Min VRAM	Recommended GPUs
Qwen3-Coder-8B	8B	8B	16 GB	1x RTX 4090
Qwen3-Coder-32B	32B	32B	64 GB	2x RTX 4090 / 1x A100
Qwen3-235B-A22B	235B	22B	48 GB	1x A100-80GB
Qwen3-Coder-480B-A35B	480B	35B	80 GB	1x H100 / 2x A100

MoE models (A22B, A35B) = only active parameters loaded during inference

Cloud GPU Rental Costs (2025-2026 estimates)

GPU	VRAM	On-Demand (USD/hr)	Reserved (USD/month)
RTX 4090	24 GB	0.40-0.80	200-400
A100 40GB	40 GB	1.50-2.50	800-1200
A100 80GB	80 GB	2.50-4.00	1200-2000
H100	80 GB	3.50-5.00	2000-3500
H200	141 GB	5.00-8.00	3000-5000

Cost Comparison Example

Scenario: Development team of 5, ~10M tokens/day total

Option	Monthly Cost (USD)	Notes
Claude API	~1100	3/M input + 15/M output
OpenAI GPT-4o	~900	2.50/M input + 10/M output
Subscription API	~100-300	Unlimited tier
Self-hosted A100	~1500	+ setup/maintenance
Local RTX 4090 (8B model)	~50	Electricity only, lower quality

Recommendations

Small teams / Light usage: Use subscription APIs
Medium teams / Privacy focus: Rent cloud GPUs with vLLM
Enterprise / Air-gapped: On-premise H100/H200 cluster
Experimentation: Local Ollama with smaller models (8B-32B)

Software Stack for Self-Hosting

# vLLM server (recommended for production)
pip install vllm
vllm serve Qwen/Qwen3-Coder-32B-Instruct --port 8000

# Or Ollama (easier setup)
ollama run qwen3:32b

Then point LiteLLM to your local endpoint:

model_list:
  - model_name: claude-3-7-sonnet-20250219
    litellm_params:
      model: Qwen/Qwen3-Coder-32B-Instruct
      api_base: http://localhost:8000/v1  # vLLM
      # api_base: http://localhost:11434/v1  # Ollama
      api_key: dummy
      custom_llm_provider: openai

Conclusion

With OpenCode + LiteLLM, you can build a privacy-focused, cost-effective alternative to Claude Code that:

Keeps your data within trusted infrastructure
Works with any OpenAI-compatible model provider
Provides full coding assistant capabilities
Scales from local Ollama to enterprise GPU clusters

The key insight is using LiteLLM's enable_anthropic_messages setting to bridge OpenCode's Anthropic API format to standard OpenAI-compatible providers.