09. Local Models with Ollama
📋 Jump to TakeawaysWhy Run Models Locally?
Three reasons to run models on your own machine instead of calling a cloud API:
- Privacy: Data never leaves your machine. No terms of service, no data retention policies, no risk of your code or documents being used for training.
- Cost: Zero per-token cost. Once you download the model, inference is free forever.
- Offline access: Works without internet. On a plane, in a coffee shop with bad wifi, wherever.
The tradeoff: local models are smaller and less capable than frontier cloud models. A 7B model on your laptop won't match GPT-4o quality. But for many tasks, it doesn't need to.
Getting Started with Ollama
Ollama is the easiest way to run models locally. It handles downloading, quantization, and serving models with a single command.
Install
# macOS
brew install ollama
# Or download from ollama.com
curl -fsSL https://ollama.com/install.sh | shStart the server
ollama serveNote: ollama serve runs in the foreground and occupies your terminal. To avoid that, run it in the background with ollama serve &. On macOS, if you installed Ollama via the desktop app or Homebrew, the server already runs as a background service — you can skip this step and use ollama run or hit http://localhost:11434 directly.
Pull and run a model
# Download and start chatting
ollama run llama3.1:8b
# Other good options
ollama run qwen2.5:7b
ollama run deepseek-r1:14b
ollama run codellama:7bThe first run downloads the model (a few GB). After that, it starts in seconds.
Use the API
Ollama exposes an OpenAI-compatible API on localhost:11434:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [
{"role": "user", "content": "Write a hello world in Go"}
]
}'Because it's OpenAI-compatible, any tool that works with the OpenAI API can point to Ollama instead. Just change the base URL to localhost.
Which Model to Run
Your choice depends on your hardware:
| RAM | Best model | Quality |
|---|---|---|
| 8 GB | Llama 3.2 3B, Qwen 2.5 3B | Basic tasks, simple chat |
| 16 GB | Llama 3.1 8B, Qwen 2.5 7B | Good for code, general tasks |
| 32 GB | Qwen 2.5 14B, DeepSeek-R1 14B | Strong reasoning, complex code |
| 64 GB+ | Llama 3.1 70B (Q4), Qwen 2.5 32B | Near-frontier quality |
The rule of thumb: the model file size should be less than 75% of your available RAM. If a model is 8 GB, you want at least 12 GB of free RAM to run it comfortably.
Best Models by Task
| Task | Recommended | Why |
|---|---|---|
| Code generation | Qwen 2.5 Coder 7B | Specialized for code, fast |
| General chat | Llama 3.1 8B | Well-rounded, good instruction following |
| Reasoning | DeepSeek-R1 14B | Shows thinking, strong at logic |
| Quick extraction | Qwen 2.5 3B | Fast, small, handles simple tasks well |
| Creative writing | Llama 3.1 8B | Better at natural language than Qwen |
Practical Use Cases
Private code review
ollama run qwen2.5-coder:7b
>>> Review this function for bugs:
>>> func divide(a, b int) int { return a / b }Your code never leaves your machine. Good for proprietary codebases where you can't send code to a third-party API.
Offline development
Working on a plane or somewhere without internet? Ollama works completely offline once the model is downloaded. No API calls, no latency, no connectivity issues.
Bulk processing without cost
Need to classify 10,000 documents? With a cloud API that might cost $50 to $200. Locally, it costs nothing but time and electricity.
# Process a file line by line
while IFS= read -r line; do
curl -s http://localhost:11434/v1/chat/completions \
-d "{\"model\": \"qwen2.5:7b\", \"messages\": [{\"role\": \"user\", \"content\": \"Classify: $line\"}]}"
done < tickets.txtLocal RAG
Combine Ollama with an embedding model for local retrieval-augmented generation. Everything stays on your machine — no cloud dependencies, no data leaving your network.
# Pull the models you need
ollama pull nomic-embed-text # embedding model
ollama pull llama3.1:8b # chat model# pip install chromadb requests
import chromadb
import requests
# 1. Set up local vector DB
client = chromadb.PersistentClient(path="./my_docs_db")
collection = client.get_or_create_collection("docs")
# 2. Embed and store your documents
def embed(text):
res = requests.post("http://localhost:11434/api/embed", json={
"model": "nomic-embed-text",
"input": text
})
return res.json()["embeddings"][0]
# Add your documents (do this once)
docs = [
"Our refund policy allows returns within 30 days of purchase.",
"Shipping takes 3-5 business days for domestic orders.",
"Premium members get free next-day shipping on all orders.",
]
for i, doc in enumerate(docs):
collection.add(ids=[str(i)], embeddings=[embed(doc)], documents=[doc])
# 3. Query: find relevant docs, then ask the chat model
question = "How long does shipping take?"
results = collection.query(query_embeddings=[embed(question)], n_results=2)
context = "\n".join(results["documents"][0])
response = requests.post("http://localhost:11434/v1/chat/completions", json={
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": f"Answer using only this context:\n{context}"},
{"role": "user", "content": question}
]
})
print(response.json()["choices"][0]["message"]["content"])How it works: nomic-embed-text converts text into 768-dimension vectors locally. ChromaDB stores and searches those vectors on disk (no server needed). When you ask a question, it finds the most relevant chunks and feeds them to llama3.1:8b as context.
Performance Tips
GPU acceleration
If you have a Mac with Apple Silicon (M1/M2/M3/M4), Ollama uses the GPU automatically. This makes inference significantly faster than CPU-only, often 3 to 5x faster for generation.
On Linux/Windows with an NVIDIA GPU, Ollama uses CUDA if available.
Context length
Local models default to shorter context windows to save memory. You can increase it:
ollama run llama3.1:8b --num-ctx 8192More context means more RAM usage. Don't set this higher than you actually need for your task.
Multiple models
You can have multiple models downloaded and switch between them:
ollama list # See what's downloaded
ollama run qwen2.5:7b # Switch to Qwen
ollama run llama3.1:8b # Switch to LlamaOllama keeps recently used models in memory for fast switching.
Tools That Connect to Local Models
Ollama gives you the model, but you'll probably want a better interface than the terminal. These tools connect to Ollama's local API:
- Open WebUI: A full chat interface (like ChatGPT) that runs in your browser locally. Supports multiple models, conversation history, RAG, and tool use.
- LM Studio: Desktop app with a nice GUI for running and chatting with local models. Also exposes an OpenAI-compatible API.
- Continue: The leading open-source AI code agent. VS Code extension that gives you autocomplete and chat powered by your local model. Best option for local coding assistance.
- Jan: Open-source desktop app similar to LM Studio. Clean interface, runs everything locally.
Since Ollama's API is OpenAI-compatible, any agent that supports custom API endpoints (Kiro, Cursor, etc.) can point to localhost:11434 and use your local model instead of a cloud API.
When Local Isn't Enough
Local models have limits. Switch to a cloud API when:
- The task requires frontier-level quality (complex reasoning, nuanced writing)
- You need a context window larger than what your RAM supports
- Speed matters and your hardware is slow
- You need multimodal capabilities (most local models are text-only)
The best setup is hybrid: local for private, bulk, and simple tasks. Cloud for complex and urgent tasks.
Key Takeaways
- Ollama makes running local models straightforward. One command to install, one to run.
- Local models are free, private, and work offline. The tradeoff is lower capability than frontier models.
- 16 GB RAM runs a 7 to 8B model comfortably. 32 GB gets you 14B models with strong quality.
- Qwen 2.5 7B and Llama 3.1 8B are the best general-purpose local models right now.
- Use local for private code, bulk processing, and offline work.
- Use cloud when you need frontier quality, massive context, or speed on slow hardware.
- The Ollama API is OpenAI-compatible, so any tool that works with OpenAI can use local models by changing the base URL.