Mini-Batch Gradient Descent

Why process data in batches of 32-64 instead of one example at a time or all 60,000 at once?

You can think of this as like a middle ground between “stochastic gradient descent” (pick one training example at a time at random) and “batch gradient descent” which calculates an average gradient for all samples within an entire dataset. Rather, mini-batch gradient descent is like the latter but with a potentially much smaller batch size (usually configurable).

How It Works

In PyTorch, the torch.optim.SGD optimizer in practice performs mini-batch gradient descent, calculating the average gradient for all samples within a provided batch and using that single averaged gradient to update the model parameters in one optimization step.

Normalization helps here too - it lets us use smaller batch sizes because the “overshoot” problem becomes less likely.