KV cache

Key-Value cache. Stores intermediate attention computations so they don’t need to be recomputed for previous tokens during generation. Grows with context length and consumes memory in addition to the model weights themselves.

The KV cache lives in VRAM during GPU inference — it competes directly with model weights for the same memory pool. On a fully GPU-resident setup, both must fit together. When using a mixed CPU/GPU inference engine (such as llama.cpp with layers offloaded to CPU), the KV cache may spill into system RAM instead, which is slower to access but removes the hard VRAM ceiling. The memory cost of the KV cache scales with context length, number of layers, and the model’s hidden dimension size, which is why large context windows are expensive even on models that otherwise fit comfortably in VRAM.