Memory

Memory pooling

A single GPU has a fixed amount of onboard memory (VRAM). For LLMs, this is the hard ceiling on what you can run — the model weights plus the KV cache must fit within it. When a model exceeds that limit, you need multiple GPUs.

The naive approach is to treat each GPU’s memory as isolated, assigning different layers to different devices. But a more powerful option is memory pooling: connecting GPUs with a high-bandwidth interconnect so that their memory is exposed as a single unified pool to the software. Two GPUs with 24GB each become an effective 48GB addressable space. This directly determines which model sizes are runnable without other tradeoffs.

Pipeline parallelism vs. tensor parallelism

When a model spans multiple GPUs, there are two fundamental ways to split the work:

Pipeline parallelism divides the model by layer — GPU A runs the first half of the network, passes its output to GPU B which runs the second half. Communication between GPUs only happens at the handoff points between stages. This works over relatively modest interconnect bandwidth because the data exchanged is just the activations at each boundary.

Tensor parallelism divides individual layers across GPUs — each GPU holds a slice of the weight matrices and computes a portion of every operation, then the results are combined before moving on. This requires constant, high-bandwidth communication within each layer. The GPUs are in continuous coordination rather than passing a baton.

Tensor parallelism generally produces lower latency per token because the GPUs are working in parallel on each computation rather than sequentially. Pipeline parallelism introduces pipeline bubbles (idle time waiting for the previous stage) and increases the time-to-first-token. For serving workloads where latency matters, tensor parallelism is preferred — but it’s only viable when the interconnect between GPUs is fast enough that the communication overhead doesn’t outweigh the parallelism benefit.

Interconnect bandwidth as the deciding factor

The bandwidth of the link between GPUs determines which parallelism strategy is practical:

  • High bandwidth (dedicated GPU interconnect): tensor parallelism is viable — GPUs can coordinate within a layer without communication becoming the bottleneck.
  • Lower bandwidth (standard PCIe): tensor parallelism’s constant inter-GPU traffic becomes the bottleneck. Pipeline parallelism is the practical choice, accepting the latency tradeoff.

NVLink was an early example of a dedicated high-bandwidth GPU interconnect available on consumer hardware, enabling tensor parallelism for prosumer multi-GPU setups. It has since been removed from consumer-grade cards. In modern datacenter hardware, the same concept is implemented at much larger scale — high-speed switching fabrics connect many GPUs within a node, and high-performance networking (such as InfiniBand) extends this across nodes in a cluster.

PCIe lanes and why dedicated connections matter

PCIe is the standard bus connecting a GPU to the CPU and system memory. Each PCIe lane is a full-duplex serial link; a GPU slot uses 16 lanes (x16), giving it a direct pipe to the CPU. The total number of lanes a CPU provides is fixed, and how they’re allocated has real consequences for LLM workloads.

Physical vs. electrical width

A slot’s physical size and its actual wired lane count can differ. A motherboard may have a physically x16 slot that only has x8 or x4 lanes connected — common on secondary slots to cut costs. A card running at x8 instead of x16 has half the bandwidth to the CPU, which affects model load times and inter-GPU data movement.

Shared lanes and bifurcation

When multiple GPUs are installed on a CPU with limited lanes, the platform must make trade-offs. Slots often bifurcate — both x16 slots drop to x8/x8 — or devices share bandwidth through a PCIe switch on the motherboard. Either way, each GPU gets less than its full allocation.

Why lane count matters for LLMs

For multi-GPU inference over PCIe, the PCIe link is the inter-GPU communication path. Reduced memory bandwidth directly limits which parallelism strategies are viable and how fast weights can be loaded into VRAM. A platform with enough dedicated lanes to give each GPU a full, uncontested x16 connection removes this bottleneck. Server-class CPUs designed for multi-GPU workloads often provide far more PCIe lanes than consumer chips precisely for this reason — enough that several GPUs can each have their own dedicated connection without competing.