HomeAI News
The open source Chinese medical model Hua Tuo GPT is here, and the blind test effect of real doctors is better than that of ChatGPT

The open source Chinese medical model Hua Tuo GPT is here, and the blind test effect of real doctors is better than that of ChatGPT

Hayo News
Hayo News
May 31st, 2023
View OriginalTranslated by Google

Professor Wang Benyou's team is located at the Chinese University of Hong Kong (Shenzhen) and the Shenzhen Institute of Big Data. They trained and open sourced a new medical large model called "Huatuo GPT", aiming to make the language model have the same diagnostic ability and Ability to provide useful information.

Based on the doctor's reply and ChatGPT answer, this model can provide accurate and substantial consultation services, making the language model like a doctor, able to provide accurate diagnosis and useful suggestions, and better serve people's health.

Large-Scale Language Modeling (LLM) has broad application prospects in the medical field. While language models like ChatGPT are capable of generating detailed, fluent, logically clear responses, they lack professionalism and accurate understanding of patient input when responding to patient descriptions of symptoms. These responses often contain multiple possibilities and take the form of higher-level recommendations, but they do not provide insight into context, limiting their practical help to patient cases.

Physician-patient interaction data can more accurately reflect the complexities of medical scenarios and provide unerring diagnostic recommendations. However, due to time constraints, physician responses are often too short to convey sufficient information and may even appear incoherent at times. Relying solely on this data to train a model can struggle to cope with different commands or conversations, and the resulting responses can appear short, unclear and sometimes confusing to patients.

In order to solve this problem, the team of Professor Wang Benyou, with the support of Chinese University of Hong Kong (Shenzhen) and Shenzhen Institute of Big Data, used instruction fine-tuning and reinforcement learning to combine the ChatGPT model and doctor responses to develop a new medical large-scale language Model - HuatuoGPT. The goal of HuatuoGPT is to achieve the model's ability to diagnose and provide useful information like a doctor by combining the "distilled data" generated by ChatGPT and the data replied by real-world doctors. At the same time, HuatuoGPT insists on providing a user dialogue experience with smooth interaction and rich content to make communication smoother.

In short, the emergence of HuatuoGPT can give full play to the advantages of the ChatGPT model on the one hand, expand its application in the medical field, and enhance the ability to understand and express real-world scenarios. On the other hand, HuatuoGPT can also draw on the experience and professional knowledge of doctors to better solve the practical challenges of the application of language models in the medical field and promote the development of medical intelligence.

Introduction to HuatuoGPT

Mixed dataset fine-tuning

HuatuoGPT uses four different datasets for fine-tuning, these datasets are:

  • Distilled Instructions from ChatGPT: This dataset extracts medical-related instructions from ChatGPT and adds department and role information to generate a qualified instruction dataset.
  • Real-world Instructions from Doctors dataset (Real-world Instructions from Doctors): This dataset is derived from questions and answers between real doctors and patients. The model improves the readability of doctors' responses by polishing them.
  • Distilled Conversations from ChatGPT (Distilled Conversations from ChatGPT): This dataset allows two ChatGPT models to imitate the dialogue between doctors and patients by sharing the dialogue background.
  • Real-world Conversations with Doctors: This dataset is derived from conversations with real doctors, but the doctor's responses are polished using a model.

These data sets allow HuatuoGPT to have a unified language model, as well as the ability of doctors to diagnose and follow instructions.

Reinforcement Learning with AI Feedback

In order to improve the quality of the HuatuoGPT model, the team adopted reinforcement learning based on AI feedback (RLAIF) technology. The technology uses ChatGPT to score the content generated by the model, considering the user-friendliness of the content, and combines the doctor's answer as a reference to take the quality of the doctor's reply into consideration. The PPO algorithm is used for training to adjust the generation preferences of the model to achieve consistency between doctors and users, thereby enhancing the richness, detail and correctness of the model generation.

Experimental results

In evaluating the performance of HuatuoGPT, the team used two methods of automatic evaluation and manual evaluation for mutual verification, and evaluated in a single-round question-and-answer scenario and a multi-round interactive diagnosis scenario.

Figure: Automatic evaluation results of multiple rounds of diagnostic scenarios

For the single-round question answering scenario, the team carefully collected 100 questions containing intents in 10 medical domains, and utilized GPT-4 for automatic evaluation. The team provided two models to generate responses to the same question, and used GPT-4 to analyze and score each model's responses. The final results show that HuatuoGPT performs significantly better than the open source Chinese medical model based on LLaMa and ChatGLM, and even exceeds GPT-3.5-turbo. This advantage is due to the fact that HuatuoGPT is trained using both data distilled from ChatGPT and real-world data, and optimized with mixed feedback from ChatGPT and professional doctors.

For the multi-round interview scenario, the team collected 100 multi-round conversations covering 20 departments for evaluation. The evaluation results show that HuatuoGPT outperforms GPT-3.5-turbo in most departments, and overall outperforms the current open source Chinese medical model. strong evidence.

In terms of manual evaluation, the team used the samples in the automatic evaluation for evaluation verification, and invited professional doctors to manually evaluate the output results of the model. The evaluation results show that whether it is a single round of manual evaluation or multiple rounds of manual evaluation results are consistent with the results of automatic evaluation, fully verifying the consistency and reliability of model performance evaluation.

Table: Manual evaluation results for single-round question-answering scenarios

Table: Manual evaluation results of multiple rounds of consultation scenarios

In addition to the HuatuoGPT model, the team also released the Huatuo-26M medical question-and-answer dataset, which includes a total of 26 million pieces of medical question-answer data, all of which are open sourced to HuggingFace. If you need clean data, you can get it by sending an email to changmiaowang@cuhk.edu.cn, and you need to indicate the unit and promise that it will only be used for scientific research purposes.

In addition, the performance of HuatuoGPT not only exceeds GPT 3.5 turbo (ChatGPT), ChatGLM and existing medical GPT, but also far better than the fully fine-tuned medium-sized T5 and GPT. This includes three publicly available medical question answering datasets including Huatuo-26M. It is worth mentioning that Huatuo is also the name of Medical GPT of SCIR Laboratory of Harbin Institute of Technology, which has made great contributions to the open source community. However, due to the duplicate name, the laboratory has changed its name to Bencao (BenTsao) . Reference link: Paper address: https://arxiv.org/pdf/2305.15075.pdf Github address: https://github.com/FreedomIntelligence/HuatuoGPT Demo address: https://www.huatuogpt.cn/


no dataCoffee time! Feel free to comment