Use ChatGPT to "command" hundreds of models, HuggingGPT lets professional models do professional things
ChatGPT becomes the curator of hundreds of models this time.
Over the past few months, ChatGPT and GPT-4 have exploded one after another, allowing people to see the extraordinary capabilities of large language models (LLM) in language understanding, generation, interaction and reasoning, which has aroused great interest in the academic and industry circles. Attention also allows people to see the potential of LLM in building general artificial intelligence (AGI) systems.
To achieve AGI, LLM has to face many challenges, including:
- Limited by the input and output forms of text generation, the current LLM lacks the ability to process complex information such as vision and speech;
- In real-world scenarios, some complex tasks usually consist of multiple subtasks, thus requiring the scheduling and collaboration of multiple models, which is also beyond the capabilities of language models;
- For some challenging tasks, LLMs show excellent results with zero or few samples, but they are still weaker than some specialized fine-tuned models.
Among them, the most important point is that the realization of AGI needs to solve complex AI tasks in different fields and different modes, and most of the existing AI models are used for specific tasks in specific fields.
Based on this, researchers from Zhejiang University and Microsoft Asia Research recently proposed a new method to let LLM act as a controller, let LLM manage existing AI models to solve complex AI tasks, and use language as a general interface. The HuggingGPT proposed by this study is a system that utilizes LLM to connect various AI models in the machine learning community (such as HuggingFace) to solve complex AI tasks.
Paper address: https://arxiv.org/abs/2303.17580
Project address: https://github.com/microsoft/JARVIS
Specifically, HuggingGPT uses ChatGPT for task planning when receiving a user request, selects a model according to the functional description available in HuggingFace, executes each subtask with the selected AI model, and aggregates responses based on the execution results. With ChatGPT's powerful language capabilities and HuggingFace's rich AI models, HuggingGPT is able to complete complex AI tasks covering different modalities and domains, and has achieved impressive results in challenging tasks such as language, vision, and speech. HuggingGPT opens up a new path towards general artificial intelligence.
Let's first look at examples of HuggingGPT completing tasks, including document question answering, image transformation, video generation, and audio generation:
There is also the ability to generate complex and detailed text descriptions for images:
To handle complex AI tasks, LLMs need to coordinate with external models to exploit their capabilities. Therefore, the crux of the problem is how to choose a suitable middleware to bridge the connection between LLM and AI model.
The study notes that each AI model can be represented as a form of language by summarizing its model capabilities. Therefore, the study proposes a concept: "Language is a common interface for LLM to connect AI models." By incorporating textual descriptions of AI models into prompts, LLM can be regarded as the "brain" for managing (including planning, scheduling, and collaborating) AI models.
Another challenge is that solving a large number of AI tasks requires collecting a large number of high-quality model descriptions. At this point, the study notes that several public ML communities often provide various models suitable for specific AI tasks with well-defined descriptions. Therefore, the research decided to connect LLM (such as ChatGPT) with public ML communities (such as GitHub, HuggingFace, Azure, etc.) to solve complex AI tasks through language-based interfaces.
Up to now, HuggingGPT has integrated hundreds of models on HuggingFace around ChatGPT, covering 24 tasks such as text classification, target detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. Experimental results demonstrate HuggingGPT's strong ability in handling multimodal information and complex AI tasks. And, HuggingGPT will continue to add task-specific AI models, enabling growable and scalable AI capabilities.
Introduction to HuggingGPT
HuggingGPT is a collaborative system where a large language model (LLM) acts as a controller and numerous expert models act as collaborative executors. Its workflow is divided into four stages: task planning, model selection, task execution and response generation.
- Task planning: LLMs such as ChatGPT first analyze user requests, decompose tasks, and plan task sequences and dependencies based on their knowledge;
- Model selection: LLM assigns parsed tasks to expert models;
- Task execution: The expert model executes the assigned tasks on the inference endpoint, and records the execution information and inference results to LLM;
- Response generation: LLM aggregates execution process logs and inference results, and returns the aggregated results to the user.
Next, let's take a look at the specific implementation process of these four steps.
In the first phase of HuggingGPT, a large language model takes user requests and breaks them down into a series of structured tasks. Complex requests often involve multiple tasks, and a large language model needs to determine the dependencies and execution order of these tasks. To facilitate efficient task planning for large language models, HuggingGPT employs specification-based instructions and demonstration-based parsing in its hint design.
By injecting several demonstrations into the hints, HuggingGPT allows large language models to better understand task planning intent and criteria. Currently, the list of tasks supported by HuggingGPT is shown in Table 1, Table 2, Table 3 and Table 4. It can be seen that HuggingGPT covers NLP, CV, voice, video and other tasks.
After parsing the task list, HuggingGPT selects an appropriate model for each task in the task list. In order to realize this process, the research first obtains the description of the expert model from HuggingFace Hub (the model description roughly includes information such as model function, architecture, supported language and domain, license, etc.) and then dynamically through the task model assignment mechanism in the context to choose a model for the task.
Once a task is assigned to a particular model, the next step is to perform the task, i.e. perform model inference. For acceleration and computational stability, HuggingGPT runs these models on a hybrid inference endpoint. Taking task parameters as input, the model computes inference results and then feeds the information back to a large language model.
After all tasks are executed, HuggingGPT enters the response generation phase. At this stage, HuggingGPT integrates all the information from the first three stages (task planning, model selection, and task execution) into a concise summary, including the list of planned tasks, model selection, and inference results. The most important of these is the inference result, which is the basis for HuggingGPT to make the final decision. These inference results appear in structured formats, such as bounding boxes with detection probabilities in object detection models, answer distributions in question answering models, etc.
The study used variants of two GPT models, gpt-3.5-turbo and text-davinci-003, as large-scale language models, which are publicly accessible through the OpenAI API. Table 5 provides a detailed hint design for the phases of task planning, model selection, and response generation.
Example of a HuggingGPT conversation demo: In the demo, a user enters a request that may include multi-tasking or multi-modal resources. Then HuggingGPT relies on LLM to organize the cooperation of multiple expert models to generate feedback to users.
Figure 3 shows the workflow of HuggingGPT when there are resource dependencies between tasks. In this case, HuggingGPT can parse out specific tasks according to the user's abstract request, including pose detection, image description, etc. In addition, HuggingGPT successfully identifies the dependencies between Task 3 and Tasks 1 and 2, and injects the inference results of Task 1 and 2 into the input parameters of Task 3 after the dependent tasks are completed.
Figure 4 demonstrates the conversational capabilities of HuggingGPT on audio and video modalities.
Figure 5 shows HuggingGPT integrating multiple user input sources to perform simple inference.
The study also tested HuggingGPT on multimodal tasks, as shown in the figure below. With the cooperation of large language models and expert models, HuggingGPT can solve various modalities such as language, image, audio, video, etc., including tasks in various forms such as detection, generation, classification, and question answering.
In addition to the above simple tasks, HuggingGPT can also complete more complex tasks. Figure 8 demonstrates the ability of HuggingGPT to handle complex tasks in multi-turn dialogue scenarios.
Figure 9 shows that for a simple request to describe an image in as much detail as possible, HuggingGPT can be extended to five related tasks, namely image captioning, image classification, object detection, segmentation, and visual question answering. HuggingGPT assigns expert models for each task, which provide image-related information from different aspects of LLM. Finally, LLM integrates this information and makes a comprehensive and detailed description.
The release of this research also made netizens lament that AGI seems to be about to rush out of the open source community.
Someone else compared it to a company manager and commented that "HuggingGPT is a bit like a scene in the real world. The company has a group of super engineers with outstanding abilities in various professions. Now there is a manager who manages them. When someone has a need, then The manager will analyze the requirements, then distribute them to the corresponding engineers, and finally merge them together and return them to the users.”
Others praised HuggingGPT as a revolutionary system that uses the power of language to connect and manage existing AI models from different fields and modalities, paving the way for AGI.
Reference link: https://twitter.com/search?q=HuggingGPT&src=typed_query&f=top