Glossary

This glossary contains key LLM and local inference terms, with each entry on its own page for detailed explanations.

Models & Architecture

Context window - The maximum number of tokens a model can process in a single inference pass
Dense model - A model architecture where all parameters are used for every inference pass
Distilled model - A smaller model trained to mimic the outputs of a larger, more capable model
Instruct model - A base model fine-tuned with instruction-following data and/or RLHF
KV cache - Key-Value cache storing intermediate attention computations to avoid recomputation
MoE (Mixture of Experts) - A model architecture where only a subset of expert sub-networks are activated per token
Open-weight model - A model whose trained weights are publicly released for download and local use
Parameter - A single numerical value in a neural network learned during training
Reasoning model - A model trained with extended chain-of-thought reasoning before giving a final answer
RLHF (Reinforcement Learning from Human Feedback) - A training technique that uses human preference data to align model outputs with desired behavior
t/s (tokens per second) - The standard metric for LLM inference speed
Token - The basic unit of text a language model processes

BPW (bits per weight) - A measure of average quantization density used to compare quantization schemes
GGUF - The file format used by llama.cpp to store quantized model weights and metadata
IQ (importance-matrix quantization) - Quantizations that use a calibration dataset to apply higher precision selectively to important weights
Q4_K_M - 4-bit K-quant medium variant — the most popular general-purpose quantization
Q6_K - 6-bit K-quant — higher quality than Q4, generally considered near-lossless
Q8_0 - 8-bit quantization — near-identical quality to FP16 at roughly half the memory footprint
Quantization - Reducing numerical precision of model weights to decrease memory footprint and increase inference speed

Memory bandwidth - The rate at which data can be read from memory — the primary bottleneck for LLM inference
Unified memory - A memory architecture where the CPU and GPU share the same physical RAM pool
VRAM - Video RAM — dedicated GPU memory used to store model weights and KV cache during inference