top of page
  • Writer's pictureNagesh Singh Chauhan

Reinforcement Learning from Human Feedback (RLHF)

Demystifying Concepts and Algorithms Behind Generative AI such as ChatGPT and Bard.

Image Credits


As AI models continue to expand in size and utility, it becomes imperative to devise strategies for ensuring their safety and mitigating any potential biases. GPT-3 serves as an illustrative case in point. This expansive language model boasts an impressive 175 billion parameters, dwarfing its predecessor GPT-2 by a factor of 100. Notably, GPT-3 surpassed earlier iterations on numerous common benchmarks in natural language processing (NLP), even without requiring retraining or fine-tuning for specific tasks.


Crucially, the creators of GPT-3 not only demonstrated the model's superior performance but also engaged in a comprehensive exploration of its broader societal implications. This included a dedicated section addressing aspects of equity, bias, and representation. Several examples stand out: GPT-3 exhibited racial biases, consistently associating "Asian" with positive sentiment and "Black" with negative sentiment across various analyzed versions. Additionally, gender bias was evident, as occupations tended to be more frequently associated with male gender identifiers than female ones. Religious biases were also observable, with terms like "violent," "terrorism," and "terrorist" showing stronger co-occurrence with Islam compared to other religions.

These biases observed in GPT-3 can be traced back to the training data. Large language models rely heavily on extensive training datasets, and unfortunately, biases like the ones highlighted above are deeply ingrained within them. This phenomenon aligns with the well-known principle of "garbage-in-garbage-out" (GIGO), where the quality of model outputs is only as good as the quality of the data used for training.

What is Reinforcement Learning?

Reinforcement learning (RL) is a subfield of machine learning that involves training an agent to take actions in an environment in order to maximize a reward signal. In the context of language models (LLMs), reinforcement learning is used to fine-tune pre-trained models to perform specific tasks, such as generating coherent and fluent text, answering questions, or completing tasks.

The basic idea behind RL for LLMs is to use a reward function that assigns a score to each action taken by the model, based on how well the action contributes to achieving the desired outcome. The model learns to optimize its behavior by maximizing the cumulative reward over time.

In the context of LLMs, the state represents the current input sequence, and the action represents the next word or token to generate. The reward function is designed to capture the quality of the generated text, and can be based on various factors such as:

  1. Perplexity: The model receives a positive reward for generating text that is less perplexing, indicating that it is more predictable and coherent.

  2. Fluency: The model receives a positive reward for generating text that is fluent and well-structured, with proper grammar and syntax.

  3. Coherence: The model receives a positive reward for generating text that is coherent and relevant to the given prompt or context.

  4. Engagement: The model receives a positive reward for generating text that engages the reader, such as by asking questions, providing interesting information, or eliciting emotions.

  5. Task completion: The model receives a positive reward for completing a task, such as answering a question or generating a summary of a document.

The reward function can be defined using a combination of these factors, and can be tailored to the specific application or use case. The model learns to optimize its behavior by maximizing the cumulative reward over time, and can adapt to changing conditions and feedback from the environment.

Some popular RL algorithms for LLMs include:

  1. Q-learning: A model-free RL algorithm that learns the optimal policy by iteratively improving an estimate of the action-value function.

  2. Deep Q-Networks (DQN): A type of Q-learning algorithm that uses a deep neural network to approximate the action-value function.

  3. Policy Gradient Methods: A class of model-free RL algorithms that learn the optimal policy by directly optimizing the expected cumulative reward.

  4. Actor-Critic Methods: A class of model-free RL algorithms that combine the benefits of policy gradient methods and value-based methods by learning both the policy and the value function simultaneously.

Reinforcement learning has shown promising results in improving the performance of LLMs, and has the potential to enable them to perform a wide range of tasks that require generation of coherent and fluent text.

What is Human Feedback?

Human feedback is an essential component of Reinforcement Learning with Human Feedback (RLHF). It refers to the input provided by humans to guide the learning process of an RL agent. In RLHF, the goal is to train an agent to perform a specific task, such as generating coherent and fluent text, by leveraging human expertise and preferences.

Human feedback can come in various forms, including ratings, annotations, and demonstrations. Ratings provide a quantitative measure of how well the agent's output meets human expectations. Annotations offer a qualitative perspective, allowing humans to highlight specific aspects of the output that need improvement. Demonstrations involve showing the agent examples of ideal outputs, which helps it understand what constitutes high-quality performance.

One key advantage of incorporating human feedback into RL is that it enables the agent to learn faster and more accurately. By providing explicit guidance, humans can help the agent avoid pitfalls and focus on the most important aspects of the task. This can significantly reduce the amount of trial and error required for the agent to achieve proficiency.

What is RLHF and How Does it Work?

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that combines reinforcement learning with human feedback to train an agent to perform a task. In traditional reinforcement learning, an agent learns to make decisions by interacting with an environment and receiving rewards or penalties for its actions. However, in many real-world applications, it is not feasible or efficient to collect and label large amounts of data through trial and error.

RLHF addresses this challenge by incorporating human feedback into the training process. Instead of relying solely on the environment to provide rewards or penalties, the agent receives feedback from a human operator who can guide the learning process by providing labels or preferences for the agent's actions. The human feedback can be in the form of explicit ratings or implicit feedback, such as clicks or dwell time.

It took around 900 pieces of feedback from a human to teach this algorithm to backflip. Credits

It consists of three processes running in parallel:

  1. A reinforcement learning agent explores and interacts with its environment, such as an Atari game.

  2. Periodically, a pair of 1-2 second clips of its behaviour is sent to a human operator, who is asked to select which one best shows steps towards fulfilling the desired goal.

  3. The human’s choice is used to train a reward predictor, which in turn trains the agent. Over time, the agent learns to maximise the reward from the predictor and improve its behaviour in line with the human’s preferences.

The system separates learning the goal from learning the behaviour to achieve it. Credits

There are several approaches to implementing RLHF, including:

  1. Inverse reinforcement learning: This approach involves learning a reward function from expert demonstrations or preference judgments. The goal is to recover a reward function that would have caused the expert to take the demonstrated actions.

  2. Deep reinforcement learning: This approach uses deep neural networks to represent the value or policy functions of the agent. The human feedback is used to update the network weights and improve the agent's performance.

  3. Apprenticeship learning: This approach involves learning from a human teacher who provides feedback in the form of rewards or punishments. The agent learns to mimic the teacher's behavior and adapt to new situations based on the feedback received.

  4. Meta-learning: This approach involves learning how to learn from a set of related tasks. The agent learns to recognize new tasks and adapt its behavior based on the feedback received from the human operator.

RLHF vs Traditional Learning

During training, the LLM learns to predict the expected reward for each action given the current state. The predicted reward is used to update the policy, which is represented by a probability distribution over the possible actions. The policy update is typically performed using a variant of Q-learning, such as deep Q-networks (DQN) or dueling DQN.

Within the realm of machine learning, two distinct paradigms emerge: conventional learning and Reinforcement Learning from Human Feedback (RLHF). These methodologies diverge in terms of how the reward function is managed and the extent of human participation in the process.

In the domain of traditional reinforcement learning, the reward function is manually defined, serving as a guide for the learning trajectory. In contrast, RLHF takes a distinctive route by instructing the model on how to discern the reward function itself. This innovative approach entails the model learning from the feedback provided by humans, eliminating the reliance on predetermined rewards. This dynamic shift fosters an adaptive and personalised learning journey.

In traditional learning, feedback tends to be confined to the labeled examples incorporated during training. Once the model is honed, it operates autonomously, making predictions or classifications without ongoing human intervention. However, the RLHF paradigm ushers in an era of perpetual learning. This framework empowers the model to harness human feedback for continuous refinement of its behavior, exploration of novel actions, and rectification of encountered errors throughout the learning process. This iterative feedback loop propels the model to consistently enhance its performance, effectively bridging the divide between human expertise and machine intelligence.

How does ChatGPT Utilise RLHF?

ChatGPT functions as an innovative AI tool for generating diverse content like chat and conversations based on given prompts. The key aim of a successful generative AI application is to mimic authentic human conversations in terms of both reading and tone. This entails the essential role of Natural Language Processing (NLP) in enabling the AI agent to comprehend the nuances of human language usage, both written and spoken.

ChatGPT employs RLHF (Reinforcement Learning from Human Feedback) to craft its conversational and lifelike responses to user queries. Through the utilization of extensive large language models (LLMs) trained on vast datasets, ChatGPT predicts subsequent words to construct coherent sentences.

Nevertheless, LLMs possess their limitations and may not always fully grasp the user's query. The query might be too open-ended or ambiguously worded. To instill within ChatGPT the art of composing dialogue resembling authentic human conversation, RLHF was harnessed in its training process. This approach facilitates the AI in grasping human expectations and fine-tuning its output.

The significance of this training methodology extends beyond merely predicting the next word; it aids in constructing whole, meaningful sentences. This distinct feature distinguishes ChatGPT from conventional chatbots, which often provide pre-scripted, fixed responses. The distinctiveness of ChatGPT stems from its tailored training involving human interaction, enabling it to fathom the question's intent and supply responses that sound both natural and genuinely helpful.

Advantages of RLHF

There are several advantages to using RLHF for training LLMs:

  • Accelerated training: RLHF utilizes human feedback to expedite the training of reinforcement learning models. By using human guidance instead of relying solely on flawed or limited goals, RLHF saves time and enhances AI summary generation by adjusting to different domains or contexts.

  • Improved performance: RLHF allows for the improvement of reinforcement learning models through human feedback. This process addresses flaws and enhances the model's decision-making. For example, RLHF can enhance chatbot responses by incorporating human feedback on quality and values, resulting in satisfied customers.

  • Reduced cost and risk: RLHF mitigates the costs and risks associated with training RL models from scratch. By leveraging human expertise, expensive trials can be skipped, and errors can be identified earlier. In drug discovery, RLHF can identify promising candidate molecules for testing, reducing the time and cost of screening.

  • Enhanced safety and ethics: RLHF trains reinforcement learning models to make ethical and safe decisions by incorporating human feedback. For instance, RLHF can assist medical models in recommending treatments that prioritize patient safety and values.

  • Increased user satisfaction: RLHF enables the customization of reinforcement learning models based on user feedback and preferences. By incorporating human insights, personalized experiences that cater to user needs can be created. RLHF can enhance recommendations in recommendation systems by incorporating user feedback.

  • Continuous learning and adaptation: RLHF facilitates the continuous learning and updating of reinforcement learning models through human feedback. By regularly receiving feedback, the models can stay up-to-date with changing conditions. For example, RLHF can assist fraud detection models in adjusting and identifying new fraud patterns more effectively.Challenges of RLHF

While RLHF has many advantages, there are also some challenges to consider:

  • Cost: Providing high-quality human feedback can be costly, especially for large-scale training.

  • Scalability: RLHF can be less scalable than traditional methods due to the need for human evaluators.

  • Noisy Feedback: Human feedback can be noisy and inconsistent, which can negatively impact the training process.

  • Evaluator Bias: Human evaluators may introduce biases into the training process, which can affect the performance of the LLM.

To overcome these challenges, researchers have developed various techniques, such as using multiple evaluators, aggregating feedback, and incorporating prior knowledge into the feedback mechanism.

Open-source tools for RLHF

The first code released to perform RLHF on LLMs was from OpenAI in TensorFlow in 2019.

Today, there are already a few active repositories for RLHF in PyTorch that grew out of this. The primary repositories are Transformers Reinforcement Learning (TRL), TRLX which originated as a fork of TRL, and Reinforcement Learning for Language models (RL4LMs).

TRL is designed to fine-tune pre-trained LLMs in the Hugging Face ecosystem with PPO. TRLX is an expanded fork of TRL built by CarperAI to handle larger models for online and offline training. At the moment, TRLX has an API capable of production-ready RLHF with PPO and Implicit Language Q-Learning ILQL at the scales required for LLM deployment (e.g. 33 billion parameters). Future versions of TRLX will allow for language models up to 200B parameters. As such, interfacing with TRLX is optimised for machine learning engineers with experience at this scale.

RL4LMs offers building blocks for fine-tuning and evaluating LLMs with a wide variety of RL algorithms (PPO, NLPO, A2C and TRPO), reward functions and metrics. Moreover, the library is easily customisable, which allows training of any encoder-decoder or encoder transformer-based LM on any arbitrary user-specified reward function. Notably, it is well-tested and benchmarked on a broad range of tasks in recent work amounting up to 2000 experiments highlighting several practical insights on data budget comparison (expert demonstrations vs. reward modeling), handling reward hacking and training instabilities, etc. RL4LMs current plans include distributed training of larger models and new RL algorithms.

Both TRLX and RL4LMs are under heavy further development, so expect more features beyond these soon.


Harnessing reinforcement learning algorithms for learning from human feedback represents a potent approach that combines the computational capabilities of AI with the valuable insights contributed by humans. By delving into the core principles of this field, acknowledging its hurdles, and envisioning its potential evolutionary paths, we can engineer more efficient systems while upholding their alignment with the original human-designed goals.

As technological strides continue to broaden our horizons, granting AI-driven machines a deeper comprehension of human interactions through reinforced feedback, we embark on a journey to explore this realm together, fostering an open-minded approach to novel ideas. Let's collaboratively seek solutions that are firmly grounded in shared values, bridging the divide between humanity and machines!


1,285 views0 comments

Recent Posts

See All


bottom of page