RLHF and DPO Compared

Introduction

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are two approaches in the field of large-scale language models used to enhance models through human guidance. This exploration on the two approaches aims to delve into the distinctive features of RLHF and DPO, providing insights into their applications, mechanisms as well as implications for the future of Gen AI technology.

What is Reinforcement Learning from Human Feedback (RLHF)?

Reinforcement Learning from Human Feedback (RLHF) is an approach in artificial intelligence that uses human preferences and guidance to enhance machine learning models. It combines reinforcement learning and supervised learning to enable Gen AI systems to learn and make decisions more human-aligned.

*RL training image from ‘Learning from human preferences’ by Open AI

Unlike traditional reinforcement learning, RLHF uses human feedback as a source of guidance, helping Gen AI systems navigate complex decision spaces and make more informed choices. By incorporating human feedback, RLHF can improve model performance, enhance user experiences, and contribute to responsible Gen AI technology development. It is a tool for training LLMs to produce higher-quality, contextually relevant text.

How does RLHF work?

1.      Supervised Fine-tuning (SFT)

Supervised fine-tuning (SFT) is the first step in RLHF, which is a process that involves initializing the language model weights. This step enables strong downstream task performance and is not randomly selected. The best SFT model is chosen based on reward model scores using a validation set.

2.      The Reward Model

The reward model in RLHF is a system that takes in a sequence of text and returns a scalar reward that numerically represents the human preference. This process involves creating a reward model (RM) calibrated with human preferences, which can be an end-to-end language model (LM) or a modular system outputting a reward. 

3.      Fine-Tuning

Reinforcement Learning (RL) has been used in language model training to optimize parameters, maximizing expected rewards from the reward model. Traditional LLM training minimizes errors concerning correct answers. The reward function acts as a learnable loss function tailored to the end goal, providing greater optimization freedom. In RLHF, the objective function is the reward model, and RL is used to optimize that objective function.

What is Direct Preference Optimization (DPO)?

Direct Preference Optimization (DPO) is an alternative method following RLHF. It simplifies the process by creating a dataset of human preference pairs, each containing a prompt and two possible completions—one preferred and one misreferred. DPO is a computationally lightweight approach that treats the constrained reward maximization problem as a classification problem on human preference data, eliminating the need for reward model fitting, extensive sampling, and extensive hyperparameter tuning.

How does DPO work?

1.      Supervised Fine-tuning (SFT)

Supervised fine-tuning (SFT) is the initial step in DPO, where an LLM is trained on a labeled dataset to create a clear mapping between inputs and desired outputs. This method, when combined with preference learning, molds the model’s responses based on human-defined criteria, ensuring they align more closely with specific requirements. SFT refines the model’s outputs to ensure they are not only accurate but also appropriate and consistent.

2.      Preference Data

Preference data is a set of options or alternatives related to a specific prompt evaluated by annotators based on guidelines. The goal is to rank these options from most preferred to least preferred, providing insights into human preferences. This information is used to fine-tune models to produce outputs that align with human expectations. After Supervised Fine-tuning (SFT), the model undergoes preference learning using preference data, ideally from the same distribution as the SFT examples. DPO’s simplicity lies in defining preference loss as a function of the policy.

Then, what is the difference between RLHF and DPO?

RLHF’s adaptability tailors models to specific tasks, enhancing versatility. The suitability of RLHF for numerical ratings or textual corrections underscores RLHF method’s effectiveness, making it valuable for exploring new possibilities in text generation or leveraging pre-trained reward models. On the other hand, DPO is simpler and more efficient, delivering performance with fewer computational resources and without problems of convergence, drift, or uncorrelated distributions. It solves the reward maximization problem with constraints in a single policy training phase, making it more efficient and robust.RLHF supports diverse human feedback like ratings or text for more distinctive features, while DPO requires binary preferences. Numerical/text feedback can be informative but subjective, but binary feedback is objective but less informative. Pros and cons fully depend on the results’ context and purposes.

Conclusion

In conclusion, the differences between RLHF and DPO show the varying approaches taken to develop Gen AI systems. By embracing the strengths of both methodologies, we may pave the way for more robust, user-friendly, and ethically sound AI systems that truly enhance our daily lives. As these technologies evolve, the prospect of achieving a harmonious collaboration between humans and machines becomes increasingly promising where Gen AI not only understands but also empathizes with human preferences, leading to more responsible and impactful advancements in the field.