09. Local Models with Ollama

Why Run Models Locally?

Three reasons to run models on your own machine instead of calling a cloud API:

Privacy: Data never leaves your machine. No terms of service, no data retention policies, no risk of your code or documents being used for training.
Cost: Zero per-token cost. Once you download the model, inference is free forever.
Offline access: Works without internet. On a plane, in a coffee shop with bad wifi, wherever.

The tradeoff: local models are smaller and less capable than frontier cloud models. A 7B model on your laptop won't match GPT-4o quality. But for many tasks, it doesn't need to.

Getting Started with Ollama

Ollama is the easiest way to run models locally. It handles downloading, quantization, and serving models with a single command.

Install

# macOS
brew install ollama

# Or download from ollama.com
curl -fsSL https://ollama.com/install.sh | sh

Start the server

ollama serve

Note: ollama serve runs in the foreground and occupies your terminal. To avoid that, run it in the background with ollama serve &. On macOS, if you installed Ollama via the desktop app or Homebrew, the server already runs as a background service — you can skip this step and use ollama run or hit http://localhost:11434 directly.

Pull and run a model

# Download and start chatting
ollama run llama3.1:8b

# Other good options
ollama run qwen2.5:7b
ollama run deepseek-r1:14b
ollama run codellama:7b

The first run downloads the model (a few GB). After that, it starts in seconds.

Use the API

Ollama exposes an OpenAI-compatible API on localhost:11434:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "user", "content": "Write a hello world in Go"}
    ]
  }'

Because it's OpenAI-compatible, any tool that works with the OpenAI API can point to Ollama instead. Just change the base URL to localhost.

Which Model to Run

Your choice depends on your hardware:

RAM	Best model	Quality
8 GB	Llama 3.2 3B, Qwen 2.5 3B	Basic tasks, simple chat
16 GB	Llama 3.1 8B, Qwen 2.5 7B	Good for code, general tasks
32 GB	Qwen 2.5 14B, DeepSeek-R1 14B	Strong reasoning, complex code
64 GB+	Llama 3.1 70B (Q4), Qwen 2.5 32B	Near-frontier quality

The rule of thumb: the model file size should be less than 75% of your available RAM. If a model is 8 GB, you want at least 12 GB of free RAM to run it comfortably.

Best Models by Task

Task	Recommended	Why
Code generation	Qwen 2.5 Coder 7B	Specialized for code, fast
General chat	Llama 3.1 8B	Well-rounded, good instruction following
Reasoning	DeepSeek-R1 14B	Shows thinking, strong at logic
Quick extraction	Qwen 2.5 3B	Fast, small, handles simple tasks well
Creative writing	Llama 3.1 8B	Better at natural language than Qwen

Practical Use Cases

Private code review

ollama run qwen2.5-coder:7b
>>> Review this function for bugs:
>>> func divide(a, b int) int { return a / b }

Your code never leaves your machine. Good for proprietary codebases where you can't send code to a third-party API.

Offline development

Working on a plane or somewhere without internet? Ollama works completely offline once the model is downloaded. No API calls, no latency, no connectivity issues.

Bulk processing without cost

Need to classify 10,000 documents? With a cloud API that might cost $50 to $200. Locally, it costs nothing but time and electricity.

# Process a file line by line
while IFS= read -r line; do
  curl -s http://localhost:11434/v1/chat/completions \
    -d "{\"model\": \"qwen2.5:7b\", \"messages\": [{\"role\": \"user\", \"content\": \"Classify: $line\"}]}"
done < tickets.txt

Local RAG

Combine Ollama with an embedding model for local retrieval-augmented generation. Everything stays on your machine — no cloud dependencies, no data leaving your network.

# Pull the models you need
ollama pull nomic-embed-text    # embedding model
ollama pull llama3.1:8b         # chat model

# pip install chromadb requests

import chromadb
import requests

# 1. Set up local vector DB
client = chromadb.PersistentClient(path="./my_docs_db")
collection = client.get_or_create_collection("docs")

# 2. Embed and store your documents
def embed(text):
    res = requests.post("http://localhost:11434/api/embed", json={
        "model": "nomic-embed-text",
        "input": text
    })
    return res.json()["embeddings"][0]

# Add your documents (do this once)
docs = [
    "Our refund policy allows returns within 30 days of purchase.",
    "Shipping takes 3-5 business days for domestic orders.",
    "Premium members get free next-day shipping on all orders.",
]
for i, doc in enumerate(docs):
    collection.add(ids=[str(i)], embeddings=[embed(doc)], documents=[doc])

# 3. Query: find relevant docs, then ask the chat model
question = "How long does shipping take?"
results = collection.query(query_embeddings=[embed(question)], n_results=2)
context = "\n".join(results["documents"][0])

response = requests.post("http://localhost:11434/v1/chat/completions", json={
    "model": "llama3.1:8b",
    "messages": [
        {"role": "system", "content": f"Answer using only this context:\n{context}"},
        {"role": "user", "content": question}
    ]
})
print(response.json()["choices"][0]["message"]["content"])

How it works: nomic-embed-text converts text into 768-dimension vectors locally. ChromaDB stores and searches those vectors on disk (no server needed). When you ask a question, it finds the most relevant chunks and feeds them to llama3.1:8b as context.

Performance Tips

GPU acceleration

If you have a Mac with Apple Silicon (M1/M2/M3/M4), Ollama uses the GPU automatically. This makes inference significantly faster than CPU-only, often 3 to 5x faster for generation.

On Linux/Windows with an NVIDIA GPU, Ollama uses CUDA if available.

Context length

Local models default to shorter context windows to save memory. You can increase it:

ollama run llama3.1:8b --num-ctx 8192

More context means more RAM usage. Don't set this higher than you actually need for your task.

Multiple models

You can have multiple models downloaded and switch between them:

ollama list              # See what's downloaded
ollama run qwen2.5:7b   # Switch to Qwen
ollama run llama3.1:8b  # Switch to Llama

Ollama keeps recently used models in memory for fast switching.

Tools That Connect to Local Models

Ollama gives you the model, but you'll probably want a better interface than the terminal. These tools connect to Ollama's local API:

Open WebUI: A full chat interface (like ChatGPT) that runs in your browser locally. Supports multiple models, conversation history, RAG, and tool use.
LM Studio: Desktop app with a nice GUI for running and chatting with local models. Also exposes an OpenAI-compatible API.
Continue: The leading open-source AI code agent. VS Code extension that gives you autocomplete and chat powered by your local model. Best option for local coding assistance.
Jan: Open-source desktop app similar to LM Studio. Clean interface, runs everything locally.

Since Ollama's API is OpenAI-compatible, any agent that supports custom API endpoints (Kiro, Cursor, etc.) can point to localhost:11434 and use your local model instead of a cloud API.

When Local Isn't Enough

Local models have limits. Switch to a cloud API when:

The task requires frontier-level quality (complex reasoning, nuanced writing)
You need a context window larger than what your RAM supports
Speed matters and your hardware is slow
You need multimodal capabilities (most local models are text-only)

The best setup is hybrid: local for private, bulk, and simple tasks. Cloud for complex and urgent tasks.

Key Takeaways

Ollama makes running local models straightforward. One command to install, one to run.
Local models are free, private, and work offline. The tradeoff is lower capability than frontier models.
16 GB RAM runs a 7 to 8B model comfortably. 32 GB gets you 14B models with strong quality.
Qwen 2.5 7B and Llama 3.1 8B are the best general-purpose local models right now.
Use local for private code, bulk processing, and offline work.
Use cloud when you need frontier quality, massive context, or speed on slow hardware.
The Ollama API is OpenAI-compatible, so any tool that works with OpenAI can use local models by changing the base URL.