Tsinghua alumni made contributions! Google released the first general medical model, 14 tasks SOTA
The world's first large-scale general practice model is officially released:
Med-PaLM M , a multimodal generative model jointly created by Google Research and DeepMind, understands clinical language, imaging, and genomics.
Med-PaLM M approaches or exceeds existing state-of-the-art on all 14 test tasks, provided all tasks use the same set of model weights .
Whereas among 246 real chest X-rays, clinicians indicated that reports generated by Med-PaLM M were more acceptable than those of professional radiologists in up to 40.50% of cases, suggesting that Med-PaLM M is not " "On paper", clinical use is just around the corner.
In this regard, Google also gave its own evaluation:
This is a milestone in the history of general medical artificial intelligence.
So, what exactly is Med-PaLM M?
The world's first general medical model is here
Before officially understanding Med-PaLM M, let's briefly introduce MultiMedBench, a multimodal medical test benchmark built by Google.
Google said that before MultiMedBench, there was a lack of such a comprehensive multimodal medical benchmark on the market.
The benchmark consists of 12 open-source datasets and 14 individual tasks to measure the ability of general biomedical AI to perform various clinical tasks.
Twelve of the datasets included a total of six biomedical data modalities (text, radiology (CT, MRI, and X-ray), pathology, dermatology, mammography, and genomics), and 14 tasks covered five types (question answering, report generation and summarization, visual question answering, medical image classification, and genomic variant calling).
Med-PaLM M is fine-tuned on top
Just as its name "M" stands for multi-modality, Med-PaLM M is a medical AI that focuses on general practice and generalists compared to the previous large medical models such as Med-PaLM and Med-PaLM-2 released by Google. For medical problems, you can watch movies directly and understand genomics.
Its basic architecture is PaLM-E (multimodal language model), and uses the ViT pre-training model as a visual encoder, and specifically implements three combinations:
-PaLM 8B+ViT 4B(PaLM-E 12B)-PaLM 62B+ViT 22B (PaLM-E 84B)-PaLM 540B+ViT 22B (PaLM-E 562B)
By fine-tuning the PaLM-E model through MultiMedBench and aligning it with the biomedical domain, Med-PaLM M was born. Here are some implementation details:
(1) In terms of dataset and preprocessing, all images in MultiMedBench are resized to 224×224×3, while padding is used as needed to preserve the original aspect ratio.
(2) Since Google's goal is to train a general-purpose biomedical AI model, use a unified model architecture and model parameters to perform multiple tasks with multi-modal inputs. To this end, they provided Med-PaLM M with instructions specific to various tasks as well as a plain-text "one-off example."
As shown in the chest x-ray interpretation and skin lesion classification tasks shown below, the instructions have a smack of written prompts, starting with "You're an awesome radiology assistant."
(3) During the training process, the author fine-tuned PaLM-E end-to-end. In the multimodal task, image tokens are interleaved with text tokens to form multimodal contextual input to the PALM-E model. For all fine-tuning tasks, the multimodal context input contains at most 1 image, whereas Med-PaLM M is able to handle inputs with multiple images during inference.
14 tasks are close to or exceed SOTA, and clinically beat 40% of radiologists
In the performance evaluation stage, the authors mainly tested Med-PaLM M's "generalist" (ie, generalist) ability, burst emergence ability, and radiology report generation quality (compared with real radiologists).
The results show that:
(1) Compared with the professional SOTA model and the generalized model (PaLM-E 84B) without fine-tuning in the biomedical field, Med-PaLM M has the best performance in all tasks, datasets and indicator combinations (14 items in total) on MultiMedBench. Basically close to SOTA or more than SOTA.
Note that this result is achieved using the same set of model weights without any task-specific customization.
(2) In the scale experiment, three different scales of Med-PaLM M have different effects on various tasks: Roughly speaking, for pure language tasks and multi-modal tasks that need to be adjusted, the larger the model, the better; But for image classification and chest X-ray report generation tasks, the effect of 84B is better than that of 562B.
(3) The reasoning ability of zero-sample thinking chain emerges. Med-PaLM M can detect tuberculosis from chest X-ray images without training, and its accuracy is not far behind the SOTA results specially optimized for this type of dataset.
However, there are still specific errors in the specific report it gave, indicating that there are still deficiencies.
(4) In the radiological report generation test, Med-PaLM M with 80B parameters had an average of 40.50% reports better than radiologists (accepted by clinicians), while 12B and 562B were 34.05% and 32.00%, respectively.
In addition, the omission and error rate tests showed that the Med-PaLM M 12B and 84B models had the lowest average omission rate per report at 0.12, followed by the 562B model at 0.13. This result is comparable to the baseline reported by human radiologists on MIMIC-CXR.
How long can it be practical?
As the first large-scale general medical model, how soon Med-PaLM M can be put into practical use must also be a question of concern to everyone.
Although it is "self-proclaimed" as a milestone (mainly because it approaches or exceeds SOTA in various biomedical tasks with a set of model weights), Google also pointed out that there are still many limitations to be resolved.
For example, there is a lack of high-quality benchmarks. According to Google, this is a key bottleneck in the development of general biomedical artificial intelligence so far, because only high-quality benchmarks can greatly promote the development of related fields.
However, the current MultiMedBench also has problems such as limited size of a single dataset and limited diversity of patterns and tasks (such as lack of transcriptomics and proteomics).
As another example, scaling multimodal AI models is also challenging.
In the language domain, this operation can significantly improve performance and emergency response. However, Google's initial experiments on Med-PaLM-M show that this is not so straightforward for multimodal generalized models in the domain of biomedical tasks due to the scarcity of medical data.
about the author
Currently, Google has only published papers on Med-PaLM M.
It has two co-authors, one of whom is Tao Tu.
He graduated from Beijing Institute of Technology (2010), graduated from Tsinghua University with a master's degree, and received a Ph.D. from Columbia University in the United States, all majoring in medical engineering. I have been working as a software engineer at Google for almost two years now.