RLHF (Reinforcement Learning from Human Feedback)

A training technique that uses human preference data to align model outputs with desired behavior. Human raters compare pairs of model responses and indicate which is better; this signal trains a reward model, which then guides further fine-tuning via reinforcement learning. Widely used to produce instruct and chat models from base pretrained models.

Referenced in