Welcome to the future of AI: Generative AI! Have you ever wondered how machines learn to understand human language and respond accordingly? Let’s take a look at ChatGPT – the revolutionary language model developed by OpenAI. With its groundbreaking GPT-3.5 architecture, ChatGPT has taken the world by storm, transforming how we communicate with machines and opening up endless possibilities for human-machine interaction. The race has officially begun with the recent launch of ChatGPT’s rival, Google BARD, powered by PaLM 2. In this article, we will dive into the inner workings of ChatGPT, how it works, what are different steps involved like Pretraining and RLHF, and explore how it can comprehend and generate human-like text with remarkable accuracy.
Generative AI opens up new creative possibilities that we never thought were possible before – Douglas Eck, Research Scientist at Google Brain
In this article, we will dive into the inner workings of ChatGPT and explore how it can comprehend and generate human-like text with remarkable accuracy. Get ready to be amazed by the cutting-edge technology behind ChatGPT and discover the limitless potential of this powerful language model.
The key objectives of the article are-
- Discuss the steps involved in the model training of ChatGPT.
- Find out the advantages of using Reinforcement Learning from Human Feedback (RLHF).
- Understand how humans are involved in making models like ChatGPT better.
Overview of ChatGPT Training
ChatGPT is a Large Language Model (LLM) optimized for dialogue. It is built on top of GPT 3.5 using Reinforcement Learning from Human Feedback (RLHF). It is trained on massive volumes of internet data.
There are mainly 3 steps involved in building ChatGPT-
- Pretraining LLM
- Supervised Finetuning of LLM (SFT)
- Reinforcement Learning from Human Feedback (RLHF)
The first step is to pretrain the LLM (GPT 3.5) on the unsupervised data to predict the next word in the sentence. This makes LLM learn the representation and various nuances of the text.
In the next step, we finetune the LLM on the demonstration data: a dataset with the questions and answers. This optimizes the LLM for dialogue.
In the final step, we use RLHF to control the responses generated by the LLM. We are prioritizing the better responses generated by the model using RLHF.
Now, we will discuss each step in detail.
I hope you are familiar with what Large Language Models (LLMs) are. If not, feel free to look into the below article.
In short, language models are statistical models that predict the next word in a sequence. Large language models are deep learning models trained on billions of words. The training data is scraped from multiple websites like Reddit, StackOverflow, Wikipedia, Books, ArXiv, Github, etc.
We can see the above image and get an idea of the side of the dataset and the number of parameters. The pretraining of LLM is computationally expensive as it requires massive hardware and a vast dataset. At the end of pretraining, we will obtain an LLM that can predict the next word in the sentence when prompted. For example, if we prompt a sentence, “Roses are red and”, it might respond with “Violets are blue.” The below image depicts what GPT-3 can do at the end of pretraining:
We can see that the model is trying to complete the sentence rather than answering it. But we need to know the answer rather than the next sentence. What could be the next step to achieve it? Let us see this in the next section.
Supervised Finetuning of LLM
So, how do we make the LLM answer the question rather than predict the next word? Supervised Finetuning of the model would help us solve this problem. We can tell the model the desired response for a given prompt and fine-tune it. For this, we can create a dataset of multiple types of questions to ask a conversational model. Human labelers can provide the appropriate responses to make the model understand the expected output. This dataset consisting of pairs of prompts and responses is called Demonstration Data. Now, let us see a sample dataset of prompts and their responses in the demonstration data.
Reinforcement Learning from Human Feedback (RLHF)
Now, we are going to learn about RLHF. Before understanding RLHF, let us first see the benefits of using RLHF.
After supervised finetuning, our model should give us the appropriate responses for the given prompts, right? Unfortunately, No! Our model might still not properly answer every question that we ask it. It might still be unable to evaluate which response is good and which is not. It could have to overfit the demonstration data. Let us see what could happen if it overfits the data. While writing this article, I asked Bard this:
I did not give it any link, article, or sentence to summarize. But it just summarized something and gave it to me, which was unexpected.
One more problem which might arise is its toxicity. Though the answer might be right, it might not be right ethically and morally. For example, look at the image below, which you might have seen before. When asked for the websites to download movies, it first responds that it is not ethical, But in the next prompt, we can easily manipulate it as shown.
Ok, now go ahead to ChatGPT and try the same example. Did it give you the same result?
Why are we not getting the same answer? Did they retrain the entire network? Probably not! There might have been a small fine-tuning with RLHF. You can refer to this beautiful gist for more reasons.
The first step in RLHF is to train a reward model. The model should be able to take the response of a prompt as input and output a scalar value that depicts how good the response is. For the machine to learn what a good response is, can we ask the annotators to annotate the responses with rewards? Once we do this, there might be biases in rewarding the responses by different annotators. So the model might not be able to learn how to reward the responses. Instead, the annotators can rank the responses from the model, which would reduce the bias in the annotations to a great extent. The below image shows a chosen response and rejected response for a given prompt from Anthropic’s hh-rlhf dataset.
From this data, the model tries to distinguish between a good and bad response.
Finetuning LLM with Reward Model Using RL
Now, we finetune the LLM with Proximal Policy Approximation(PPO). In this approach, we get the reward for the response generated by the initial language model and the current iteration of the fine-tuned iteration. We compare the current language model with the initial language model so that the language model does not deviate too much from the right answer while generating a neat, clean, and readable output. KL-divergence is used to compare both models and then finetune the LLM.
The models have been constantly evaluated at the end of each step with a different number of parameters. You can see the methods and their respective scores in the images below:
We can compare the performance of the LLMs at different stages w.r.t different model sizes in the above figure. As you can see, there is a significant increase in the results after each training phase.
We can replace the Human in RLHF in this segment with Artificial Intelligence RLAIF. This significantly reduces the cost of labeling and has the potential to perform better than RLHF. Let’s discuss that in the next article.
In this article, we saw how conversational LLMs like ChatGPT are trained. We saw the three phases of training ChatGPT and how reinforcement learning from human feedback has helped the model improve its performance. We also understood the importance of each step, without which the LLM would be inaccurate.
Hope you enjoyed reading it. Feel free to leave comments below in case of any query/feedback. Happy Learning 🙂