Inference Stack

Both llama.cpp and vLLM are production-quality engines. llama.cpp prioritizes broad compatibility — it’s typically the first to support new architectures (MoE, latest model releases) and can run models that don’t fully fit in VRAM via mixed CPU/GPU inference. vLLM is focused purely on maximum token throughput; it requires the model to fit entirely in VRAM and its advantage is most pronounced under concurrent load (multiple users, batched requests) where its architecture is purpose-built to excel.

llama.cpp - An open-source C/C++ inference engine for running GGUF models locally. Supports CPU, CUDA (NVIDIA), ROCm (AMD), Vulkan, and Metal (Apple) backends.
Ollama - A user-friendly wrapper around llama.cpp that simplifies model downloading, management, and serving via a local REST API.
Vulkan - A cross-platform GPU API supported by llama.cpp for GPU-accelerated inference. Ships with standard Mesa drivers on Linux. Vendor-specific alternatives include CUDA (NVIDIA) and ROCm (AMD).
vLLM - A high-throughput inference server framework for production deployments and multi-user serving. Supports CUDA (NVIDIA), ROCm (AMD), and CPU backends.

VRAM

Tensor

Codex

Matt Oswalt

Title here

Inference Stack