ollama

Model Management

 1# List all available models
 2ollama list
 3
 4# See currently loaded models, context size, processor, memory usage
 5ollama ps
 6
 7# Pull a model from the registry
 8ollama pull qwen3:30b-a3b
 9
10# Remove a model
11ollama rm qwen3:30b-a3b
12
13# Create a model from a Modelfile
14ollama create my-model:tag -f /path/to/Modelfile
15
16# Show model metadata (architecture, quantization, parameters)
17ollama show qwen3:30b-a3b
18ollama show qwen3:30b-a3b | grep quantization

Testing a Model Directly

Bypass any client (opencode, etc.) to isolate whether slowness is a model or configuration issue. The response JSON includes eval_rate (tokens/sec), eval_count, and duration fields.

 1curl http://localhost:11434/api/generate -d '{
 2  "model": "qwen3:30b-a3b",
 3  "prompt": "say hello",
 4  "stream": false,
 5  "options": { "num_ctx": 32768 }
 6}' | python3 -m json.tool
 7
 8# Quick performance summary
 9curl http://localhost:11434/api/generate -d '{
10  "model": "qwen3:30b-a3b",
11  "prompt": "say hello",
12  "stream": false,
13  "options": { "num_ctx": 32768 }
14}' | python3 -c "
15import sys, json
16d = json.load(sys.stdin)
17gen = d['eval_count'] / (d['eval_duration'] / 1e9)
18prompt = d['prompt_eval_count'] / (d['prompt_eval_duration'] / 1e9)
19load = d['load_duration'] / 1e9
20print(f'load:   {load:.2f}s')
21print(f'prompt: {prompt:.1f} tok/s ({d[\"prompt_eval_count\"]} tokens)')
22print(f'gen:    {gen:.1f} tok/s ({d[\"eval_count\"]} tokens)')
23print(f'thinking: {bool(d.get(\"thinking\"))}')
24"

Debugging Performance

 1# Stream verbose server logs (shows model load events, context alloc, errors)
 2tail -f ~/.ollama/logs/server.log
 3
 4# Enable debug logging (restart ollama serve with this env var)
 5OLLAMA_DEBUG=1 ollama serve
 6
 7# Monitor GPU utilization and thermal state on Apple Silicon
 8sudo powermetrics --samplers gpu_power,thermal -i 1000
 9
10# Check memory pressure and swap
11vm_stat | grep -E "Pages (free|active|inactive|wired|swapped)"

Key Environment Variables

VariablePurposeExample
OLLAMA_NUM_CTXDefault context window size for all models32768
OLLAMA_KEEP_ALIVEHow long to keep a model loaded after last request10m, 0 (unload immediately), -1 (never unload)
OLLAMA_DEBUGEnable verbose server logging1
OLLAMA_MODELSOverride model storage directory/path/to/models

Diagnosing Context Size Mismatches

If a model takes unexpectedly long on the first request, it is likely being reloaded because the requested num_ctx differs from what ollama last loaded it with. Ollama unloads and reloads the model whenever the context window size changes.

 1# Check what context size a model is currently loaded with
 2ollama ps   # CONTEXT column shows current value
 3
 4# Force a specific context for a one-off test
 5curl http://localhost:11434/api/generate -d '{
 6  "model": "qwen3:30b-a3b",
 7  "prompt": "hello",
 8  "stream": false,
 9  "options": { "num_ctx": 32768 }
10}'

The correct fix is to bake num_ctx into the Modelfile so the context size is consistent across all clients.

OpenCode Integration Notes

When using ollama as a provider in opencode, note that limit.context in the opencode model configuration is informational only — it controls the context-remaining display in the UI but is not sent to ollama as num_ctx. The model will use whatever num_ctx is set in its Modelfile (or ollama’s default) regardless of what opencode reports.

To ensure consistent behavior:

  1. Set num_ctx in the Modelfile for any model used with opencode.
  2. Match limit.context in opencode’s config to that value so the remaining-context indicator is accurate.
  3. Use ollama ps to confirm the loaded context size after the first request.