VRAM

← Back to Glossary

Memory on a discrete GPU, used to store model weights and KV cache during inference. The hard ceiling for what models can run at full GPU speed. On discrete GPUs like the NVIDIA RTX 4080 Super (16GB) or AMD Radeon RX 7900 XTX (24GB), VRAM is separate from system RAM and fixed at the factory. Contrast with unified memory, where CPU and GPU share the same pool.

Unified Memory for CUDA - NVIDIA

Referenced in

Context window
KV cache
MoE (Mixture of Experts)
Q8_0
Unified memory
Inference Stack
Model Evaluation
Memory

Unified memory

Inference Stack

Codex

Matt Oswalt

Title here

VRAM

Referenced in