ollama
Model Management
1# List all available models
2ollama list
3
4# See currently loaded models, context size, processor, memory usage
5ollama ps
6
7# Pull a model from the registry
8ollama pull qwen3:30b-a3b
9
10# Remove a model
11ollama rm qwen3:30b-a3b
12
13# Create a model from a Modelfile
14ollama create my-model:tag -f /path/to/Modelfile
15
16# Show model metadata (architecture, quantization, parameters)
17ollama show qwen3:30b-a3b
18ollama show qwen3:30b-a3b | grep quantizationTesting a Model Directly
Bypass any client (opencode, etc.) to isolate whether slowness is a model or configuration issue.
The response JSON includes eval_rate (tokens/sec), eval_count, and duration fields.
1curl http://localhost:11434/api/generate -d '{
2 "model": "qwen3:30b-a3b",
3 "prompt": "say hello",
4 "stream": false,
5 "options": { "num_ctx": 32768 }
6}' | python3 -m json.tool
7
8# Quick performance summary
9curl http://localhost:11434/api/generate -d '{
10 "model": "qwen3:30b-a3b",
11 "prompt": "say hello",
12 "stream": false,
13 "options": { "num_ctx": 32768 }
14}' | python3 -c "
15import sys, json
16d = json.load(sys.stdin)
17gen = d['eval_count'] / (d['eval_duration'] / 1e9)
18prompt = d['prompt_eval_count'] / (d['prompt_eval_duration'] / 1e9)
19load = d['load_duration'] / 1e9
20print(f'load: {load:.2f}s')
21print(f'prompt: {prompt:.1f} tok/s ({d[\"prompt_eval_count\"]} tokens)')
22print(f'gen: {gen:.1f} tok/s ({d[\"eval_count\"]} tokens)')
23print(f'thinking: {bool(d.get(\"thinking\"))}')
24"Debugging Performance
1# Stream verbose server logs (shows model load events, context alloc, errors)
2tail -f ~/.ollama/logs/server.log
3
4# Enable debug logging (restart ollama serve with this env var)
5OLLAMA_DEBUG=1 ollama serve
6
7# Monitor GPU utilization and thermal state on Apple Silicon
8sudo powermetrics --samplers gpu_power,thermal -i 1000
9
10# Check memory pressure and swap
11vm_stat | grep -E "Pages (free|active|inactive|wired|swapped)"Key Environment Variables
| Variable | Purpose | Example |
|---|---|---|
OLLAMA_NUM_CTX | Default context window size for all models | 32768 |
OLLAMA_KEEP_ALIVE | How long to keep a model loaded after last request | 10m, 0 (unload immediately), -1 (never unload) |
OLLAMA_DEBUG | Enable verbose server logging | 1 |
OLLAMA_MODELS | Override model storage directory | /path/to/models |
Diagnosing Context Size Mismatches
If a model takes unexpectedly long on the first request, it is likely being reloaded because the requested num_ctx differs from what ollama last loaded it with. Ollama unloads and reloads the model whenever the context window size changes.
1# Check what context size a model is currently loaded with
2ollama ps # CONTEXT column shows current value
3
4# Force a specific context for a one-off test
5curl http://localhost:11434/api/generate -d '{
6 "model": "qwen3:30b-a3b",
7 "prompt": "hello",
8 "stream": false,
9 "options": { "num_ctx": 32768 }
10}'The correct fix is to bake num_ctx into the Modelfile so the context size is consistent across all clients.
OpenCode Integration Notes
When using ollama as a provider in opencode, note that limit.context in the opencode model configuration is informational only — it controls the context-remaining display in the UI but is not sent to ollama as num_ctx. The model will use whatever num_ctx is set in its Modelfile (or ollama’s default) regardless of what opencode reports.
To ensure consistent behavior:
- Set
num_ctxin the Modelfile for any model used with opencode. - Match
limit.contextin opencode’s config to that value so the remaining-context indicator is accurate. - Use
ollama psto confirm the loaded context size after the first request.