Large model RLHF does not have to rely on people, Google: AI feedback is equally effective
Speaking of the core method of training large models today, RLHF is an unavoidable topic.
RLHF, that is, reinforcement learning based on human feedback, is inseparable from ChatGPT or open source LLaMA.
But the "H" is a big bottleneck, because it is too difficult to collect high-quality human feedback.
Can it be handed over to AI to do it? Someone did, but the jury was still out on whether it could replace RLHF until Google did this research.
In a newly published arXiv paper, Google compared the performance of models trained by RLAIF to RLHF on text summarization.
RLAIF uses AI to replace humans in RLHF to complete the work of generating feedback, so that large model training is no longer subject to the limitations of humans .
In trained human evaluations, there is little difference in people's propensity for answers generated by models trained on RLHF and RLAIF.
Even in some details, the performance of RLAIF is better than RLHF.
Some AI engineers forwarded this paper and commented that by the time of GPT-5, human data labelers may no longer be needed.
Before introducing the detailed evaluation results, we might as well take a look at the workflow of RLAIF.
Generating feedback data with LLM
In fact, RLAIF is similar to RLHF, that is, humans are replaced by AI, which can be seen literally.
So the focus naturally came to generating feedback content.
The researchers first asked the AI to choose between two answers for feedback.
To avoid problems with randomness, multiple selections are made, with the order of options swapped.
The Chain of Thought (CoT) reasoning mode is also used to obtain better answers.
In addition, in order to improve the self-consistency of LLM, this process is not to directly choose one of the two answers, but to score the two answers separately and add up to 1.
The prompt and output of this process are probably those of Aunt Jiang:
With this data, we can use it to train the reward model and predict the preference score.
Then, using the trained reward model, the researchers let the target model perform reinforcement learning.
Different from the PPO (Proximal Policy Optimization) algorithm commonly used in other models, RLAIF uses a simpler and more effective modified version of the A2C (Advantage Actor Critic) algorithm.
Of course, you can also directly use the annotated data generated by AI to perform reinforcement learning without training the reward model.
In fact, the labeled data set obtained by the team is larger and more usable than the reward model, but considering the high computational cost, the reward model was chosen.
At this point, the "course" of large models has been completed, but if you want to "graduate", you have to go through another "exam".
The "examination" includes the following three contents:
AI Labeler Alignment: How accurate AI preferences are relative to human preferences
Pairwise Accuracy: How well the trained reward model matches the human preference dataset
Win Rate: Human preference between RLAIF and RLHF generated results
After such testing, reinforcement learning was finally completed.
So, what are the results of “students taught by AI”?
The test effect can be compared with RLHF
The research team recruited 1,200 people to rank the answers given by SFT (baseline supervised fine-tuning), RLHF, RLAIF and real people from high quality to poor quality.
Taking the SFT method as the baseline, the Win Rates of RLHF and RLAIF both exceed 70%, which means that human beings are nearly three times more inclined to these two methods than SFT.
Although the performance of RLHF is slightly better than that of RLAIF, the gap between the two is not obvious.
However, if RLHF is used as a reference, the Win Rate of RLAIF is 50%, indicating that human beings have the same tendency towards the two.
Interestingly, the results of both RL-trained models far outperform the answers given directly by real people.
Compared with real people, RLAIF's Win Rate is as high as 79%, while RLHF is 80%, that is, the tendency is four times that of real people's answers.
In addition, after careful evaluation of the output content, the researchers also found that the model trained by RLAIF had a lower probability of hallucinations and fewer logical and grammatical errors than RLHF.
One More Thing
However, for RLAIF, some netizens also discovered Huadian:
Isn't the model used to generate feedback also trained with RLHF?
On the other hand, in the process of RLHF, the possibility that some people are using AI to "lazy" cannot be ruled out.
Maybe "I am in you, and you are in me" is the reason why the test results of the two methods are so close?
Paper address: https://www.arxiv.org/abs/2309.00267