Glossary

This glossary contains key LLM and local inference terms, with each entry on its own page for detailed explanations.

Models & Architecture

  • Context window - The maximum number of tokens a model can process in a single inference pass
  • Dense model - A model architecture where all parameters are used for every inference pass
  • Distilled model - A smaller model trained to mimic the outputs of a larger, more capable model
  • Instruct model - A base model fine-tuned with instruction-following data and/or RLHF
  • KV cache - Key-Value cache storing intermediate attention computations to avoid recomputation
  • MoE (Mixture of Experts) - A model architecture where only a subset of expert sub-networks are activated per token
  • Open-weight model - A model whose trained weights are publicly released for download and local use
  • Parameter - A single numerical value in a neural network learned during training
  • Reasoning model - A model trained with extended chain-of-thought reasoning before giving a final answer
  • RLHF (Reinforcement Learning from Human Feedback) - A training technique that uses human preference data to align model outputs with desired behavior
  • t/s (tokens per second) - The standard metric for LLM inference speed
  • Token - The basic unit of text a language model processes

Quantization

  • BPW (bits per weight) - A measure of average quantization density used to compare quantization schemes
  • GGUF - The file format used by llama.cpp to store quantized model weights and metadata
  • IQ (importance-matrix quantization) - Quantizations that use a calibration dataset to apply higher precision selectively to important weights
  • Q4_K_M - 4-bit K-quant medium variant — the most popular general-purpose quantization
  • Q6_K - 6-bit K-quant — higher quality than Q4, generally considered near-lossless
  • Q8_0 - 8-bit quantization — near-identical quality to FP16 at roughly half the memory footprint
  • Quantization - Reducing numerical precision of model weights to decrease memory footprint and increase inference speed

Hardware & Memory

  • Memory bandwidth - The rate at which data can be read from memory — the primary bottleneck for LLM inference
  • Unified memory - A memory architecture where the CPU and GPU share the same physical RAM pool
  • VRAM - Video RAM — dedicated GPU memory used to store model weights and KV cache during inference