When Local Makes Sense
Local AI models make sense when one or more of these apply:
- NDA or IP restrictions — your contract or company policy prohibits sending code to external services.
- Regulated data — health records (HIPAA), financial data, government-classified material, or similar.
- Air-gapped environment — no outbound internet access by policy or infrastructure constraint.
- Cost control at scale — running millions of short classification calls locally can be cheaper than API costs, depending on your hardware.
- Latency requirements — local inference eliminates network round-trips, which matters for real-time tooling.
If none of these apply, a cloud model is almost always the better choice. Local models are meaningfully less capable than frontier models, and the gap has remained large even as local models have improved.
The Main Tools
Ollama — Developer-first CLI
Ollama is the easiest way to run local models. It handles model download, quantization, and serving through a simple CLI. It also exposes an OpenAI-compatible HTTP API, so tools built for the OpenAI API work with Ollama with a one-line URL change.
# Install (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull codellama:13b
ollama run codellama:13b "Explain this function: ..."
# Or run as a local API server (http://localhost:11434)
ollama serve
The OpenAI-compatible endpoint means you can point the official OpenAI SDK at Ollama and use local models from existing code:
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // required by the library, not used by Ollama
});
const response = await client.chat.completions.create({
model: 'qwen2.5-coder:7b',
messages: [{ role: 'user', content: 'Review this code...' }],
});
LM Studio — GUI with model browser
LM Studio provides a desktop GUI for browsing, downloading, and running models. It's a good choice if you want to compare models quickly or if you prefer not working with the CLI. It also exposes a local HTTP server with an OpenAI-compatible API.
Jan — Privacy-first, offline-first
Jan is open-source and explicitly designed for fully offline use. It has a clean UI, a model hub, and works on Mac, Windows, and Linux. Good choice for teams that want a user-facing local AI tool with no cloud dependency at all.
Which Models to Run
Local model quality depends heavily on your hardware. A rule of thumb: you need roughly 6 GB of VRAM to run a 7B-parameter model comfortably, 12–16 GB for 13B, and 24+ GB for larger models. CPU inference works but is significantly slower.
For Code Tasks
- Qwen2.5-Coder 7B/14B — strong coding capability at reasonable size. Good balance for most development tasks on consumer hardware.
- DeepSeek Coder V2 Lite — specialised for code, handles fill-in-the-middle well for autocomplete-style use.
- CodeLlama 13B — solid general code model, widely available via Ollama.
For General Reasoning and Chat
- Llama 3.3 70B — Meta's strongest open model. Requires significant VRAM (or quantized to Q4 for ~40 GB RAM). Best local quality for general tasks.
- Mistral 7B / Nemo — fast, efficient, good instruction following. Better than expected for its size.
- Gemma 2 9B — Google's open model, strong instruction following, good for constrained devices.
Even the best local models lag behind Claude Sonnet on complex multi-step reasoning, long-context tasks, and subtle code review. For tasks where quality matters most, local models are a tradeoff, not a replacement.
Connecting VS Code to a Local Model
Most AI coding extensions that support custom API endpoints can point at an Ollama server. The pattern is the same across tools: set the base URL to http://localhost:11434/v1 and the model to whatever you're running.
With Continue
Continue is an open-source VS Code extension built for local model support. Edit ~/.continue/config.json:
{
"models": [
{
"title": "Qwen2.5 Coder (local)",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
],
"tabAutocompleteModel": {
"title": "DeepSeek Coder",
"provider": "ollama",
"model": "deepseek-coder:6.7b-base"
}
}
With Cursor
Cursor supports custom OpenAI-compatible endpoints under Settings → Models → Add model. Set the base URL to your Ollama server and the model name to whatever you've pulled. Cursor's full context-aware features work best with its own models, but local models work for basic chat and inline edits.
Handling Slow Inference
Local model inference is slower than API calls, especially on CPU. A few approaches for managing this:
- Use smaller models for interactive tasks — a 7B model responds in seconds; a 70B model on CPU can take minutes. Match model size to latency tolerance.
- Use GPU acceleration if available — even a consumer GPU dramatically speeds up inference. Ollama auto-detects and uses available GPU layers.
- Batch non-interactive tasks — if you're analyzing a batch of files, queue them rather than waiting for each response before starting the next.
- Prefill common prompts — Ollama and LM Studio keep models loaded in memory between calls. The first call is slow; subsequent calls with the same model are faster.
Private Cloud as a Middle Option
If your constraint is "no third-party SaaS" but you have cloud infrastructure, private deployment is another path:
- AWS Bedrock or Azure AI — frontier models (including Claude) available inside your VPC, with data staying in your cloud account.
- Self-hosted open models on EC2 / GCP — run Ollama or vLLM on a GPU instance. You own the infrastructure.
- Anthropic's enterprise tier — includes data processing agreements and controls for teams with compliance requirements.
Private cloud gives you frontier model quality with infrastructure-level control — at higher infrastructure cost than API calls, but often lower total cost than local hardware at scale.
Local Model Decision Guide
- NDA or data restrictions → start with Ollama + Qwen2.5-Coder for code tasks.
- Air-gapped → pull models while connected, then disconnect. Jan works fully offline.
- Cost control at scale → benchmark local inference vs API cost for your specific volume before committing to hardware.
- Quality-critical tasks → consider private cloud (Bedrock/Azure AI) rather than local, to keep frontier model quality.
- No data restriction → cloud API is almost always the better developer experience. Don't optimise for local without a real constraint.
Related Guides
Sanitizing Code and Data Before Sending to AI
What to scrub from code, logs, and customer data before using cloud AI — and when to switch to local.
VS Code & Cursor with AI
Configure your editor for AI-first development — including custom model endpoints for local setups.