HomeAI News
Comparable to ChatGPT! Meta Chinese proposed "Shepherd" Shepherd, LLaMA 7 billion parameter fine-tuning, evaluation model generation to give suggestions
126

Comparable to ChatGPT! Meta Chinese proposed "Shepherd" Shepherd, LLaMA 7 billion parameter fine-tuning, evaluation model generation to give suggestions

Hayo News
Hayo News
August 15th, 2023
View OriginalTranslated by Google
Large model generation content also needs self-improvement. The Shepherd model proposed by Meta can evaluate model generation and give suggestions.

Recently, Meta AI's latest research proposed the language model Shepherd, which is specially used to evaluate the model response and make suggestions for improvement.

In this regard, the researchers integrated a high-quality feedback data set with about 7 billion parameters through community feedback and manual annotation.

Paper address: https://arxiv.org/pdf/2308.04592.pdf

Compared with the GPT-4 evaluation, Shepherd has an average winning rate of 53-87%, which is much higher than other competing products.

Also, in human evaluations, Shepherd completely outperforms other models and is on average close to ChatGPT.

"Shepherd" Shepherd

Currently, large models have become more and more complex, showing remarkable ability to generate coherent, contextual and semantic text.

Despite these advances, large models still frequently make mistakes, producing unreliable and incoherent output.

Therefore, continuous critique and improvement of generative methods would be a very beneficial step towards more reliable language models.

In this study, Meta proposes a language model, Shepherd, explicitly tuned to the output generated by a critique model.

When asked to refine the output, Shepherd can point out specific issues such as factuality, logical errors, coherence, and consistency, while also suggesting improvements.

More specifically, Shepherd can generate natural language feedback that not only gives overall judgments, or general recommendations, but also involves deep domain knowledge and provides actionable suggestions for improvement.

Shepherd Overall Framework

To fine-tune and evaluate Shepherd, the researchers created a high-quality feedback dataset consisting of two different datasets:

(1) Community feedback, collected from online forums to gather more diverse interactions;

(2) Human-annotated feedback, collected from different types of tasks.

For example, examples of training data collected from Stack Exchange and Human Annotation.

Shepherd model

The researchers trained Shepherd with LLaMA-7B as the base model and used AdamW as the optimizer with β1 = 0.9, β2 = 0.95 and a weight reduction of 0.1.

Then, use a learning rate of 1e-5 and 2000 warm-up steps with a batch size of 64 and a maximum sequence length of 2048.

The format of the training data uses the same template, using "### {field name}" to separate different fields.

Checkpoints are kept for every 50 steps, for a total of 3000 steps.

The researchers manually checked the generated feedback to identify errors or make constructive suggestions on a holdout set of 20 examples, and selected the 3 best checkpoints.

Then, using the GPT-4 evaluation protocol, the best checkpoints are selected on the held-out example set.

Evaluate

To test Shepherd's critical capabilities for model generation, the researchers compared it to a range of state-of-the-art language models, including Alpaca-7B, SelFee-7B, and ChatGPT.

By using GPT-4 as an evaluation tool, both human evaluation and automatic evaluation are performed.

To broadly cover the NLP field, the researchers carefully selected 6 public datasets for evaluation:

  • Alpaca Farm
  • FairEval
  • Commonsense QA
  • OBQA
  • PIQA
  • Truthful QA

These 6 datasets cover a wide range of topics and reasoning skill sets, including commonsense reasoning, physical reasoning, mathematical reasoning, and more.

The researchers then sampled 50 instances from the validation/test sets of each dataset, for a total of 300 instances in the final evaluation set.

The team first analyzed whether Shepherd could generate better feedback than competing models. The comparative results using GPT-4 and human evaluation are shown in Figure 2 and Figure 3 below, respectively.

In both evaluation settings, Shepherd significantly outperforms Alpaca and SelFee.

Note that both Shepherd and SelFee are fine-tuned LLaMA-7B models, but SelFee is fine-tuned on a dataset with 178K examples, while Shepherd is only fine-tuned on a dataset with 8K examples.

According to GPT-4 evaluation, Shepherd's performance is slightly higher than ChatGPT, while in human evaluation, Shepherd's performance is comparable to ChatGPT.

Overall, after training on a combination of datasets, Shepherd demonstrates impressive results, outperforming ChatGPT on several downstream tasks.

Careful examination of the impact of community feedback and human-annotated feedback data reveals that community data is more informative and diverse than human-annotated data, but biased towards informality.

These nuances allow Shepherd to provide feedback on different tasks.

At the same time, the researchers found that including high-quality human-annotated data for fine-tuning improved model performance.

The researchers then performed model evaluation (GPT4) as well as human evaluation on the feedback generated by Shepherd and compared it with the state-of-the-art baseline.

Shepherd's reviews are generally favored over other models.

For example, Alpaca tends to give positive feedback to all responses from the model, leading to a lot of false feedback.

SelFee tends to provide vague feedback, fails to pinpoint mistakes, ignores model responses or directly answers questions rather than criticizing responses.

ChatGPT is more stable across different evaluation settings and does a better job of providing correct judgment feedback.

about the author

There are 2 people working together.

Tianlu Wang

Tianlu Wang is a Research Scientist at Meta AI Research.

She received her Ph.D. in Computer Science from the University of Virginia under Vicente Ordóñez Román. Before that, she also received a bachelor's degree in computer science from Zhejiang University.

Ping Yu

Ping Yu is a FAIR Research Scientist.

He has a Ph.D. in Computing from the State University of New York at Buffalo and an MS in Computational Engineering from the University of Michigan.

References:

https://github.com/facebookresearch/Shepherd

https://huggingface.co/papers/2308.04592

Reprinted from 新智元 桃子View Original

Comments