HomeAI News
ChatGPT will choose the model by itself! The HuggingGPT project has been open-sourced with Microsoft Asia Research Institute + Zhejiang University's new paper
8

ChatGPT will choose the model by itself! The HuggingGPT project has been open-sourced with Microsoft Asia Research Institute + Zhejiang University's new paper

Hayo News
Hayo News
April 2nd, 2023
View OriginalTranslated by Google
"Jarvis" has arrived! Microsoft Asia Research Institute and Zhejiang University launched HuggingGPT, a large-scale model collaboration system, which allows ChatGPT to coordinate HF community models and has a super ability to handle various multi-modal tasks.

The AI craze detonated by ChatGPT has also "burned" the financial circle.

Recently, researchers at Bloomberg have also developed a GPT in the financial field—Bloomberg GPT, with 50 billion parameters.

The emergence of GPT-4 has given many people a taste of the power of large language models.

However, OpenAI is not open. Many people in the industry have started to clone GPT, and many ChatGPT replacement models are built on the basis of open-source models, especially Meta's open-source LLMa model.

For example, Stanford’s Alpaca, UC Berkeley teamed up with CMU, Stanford’s Vicuna, and the start-up company Databricks’ Dolly, etc.

Various ChatGPT-like large-scale language models built for different tasks and applications have shown a tendency of contention in the entire field.

So the question is, how do researchers choose an appropriate model, or even multiple models, to complete a complex task?

Recently, the research team of Microsoft Asia Research Institute and Zhejiang University released HuggingGPT, a large-scale model collaboration system.

Paper address: https://arxiv.org/pdf/2303.17580.pdf

HuggingGPT uses ChatGPT as a controller to connect various AI models in the HuggingFace community to complete multimodal complex tasks.

This means that you will have a kind of super magic, through HuggingGPT, you can have multi-modal capabilities, Wensheng pictures, Wensheng videos, and voices are all in hand.

HuggingGPT bridge

The researchers pointed out that solving the current problems of large language models (LLMs) may be the first and crucial step towards AGI.

Because current techniques for large language models still have some shortcomings, there are some pressing challenges on the way to building AGI systems.

- Limited by the input and output forms of text generation, current LLMs lack the ability to process complex information (such as vision and speech);

- In practical application scenarios, some complex tasks are usually composed of multiple subtasks, so the scheduling and collaboration of multiple models is required, which is beyond the capabilities of the language model;

- For some challenging tasks, LLMs show excellent results in zero- or few-shot settings, but they are still weaker than some experts (such as fine-tuning models).

To handle complex AI tasks, LLMs should be able to coordinate with external models to exploit their capabilities. Therefore, the key point is how to choose the appropriate middleware to bridge LLMs and AI models.

The researchers found that each AI model can be expressed in a form of language by summarizing its model capabilities.

From this, a concept was introduced, "language is LLMs, namely ChatGPT, a general interface to connect artificial intelligence models".

By incorporating AI model descriptions into prompts, ChatGPT can be regarded as the brain that manages the AI model. Therefore, this method allows ChatGPT to call external models to solve practical tasks.

Simply put, HuggingGPT is a collaborative system, not a large model.

Its role is to connect ChatGPT and HuggingFace, and then process the input of different modalities, and solve many complex artificial intelligence tasks.

Therefore, each AI model in the HuggingFace community has a corresponding model description in the HuggingGPT library, and it is fused into the prompt to establish a connection with ChatGPT.

Subsequently, HuggingGPT uses ChatGPT as the brain to determine the answer to the question.

So far, HuggingGPT has integrated hundreds of models on HuggingFace around ChatGPT, covering 24 tasks such as text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video.

The experimental results prove that HuggingGPT has the ability to handle multi-modal information and complex artificial intelligence tasks.

Four Step Workflow

The entire workflow of HuggingGPT can be divided into the following four stages:

- Task planning: ChatGPT parses user requests, breaks them down into multiple tasks, and plans task sequences and dependencies based on its knowledge

- Model selection: LLM assigns parsed tasks to expert models based on model descriptions in HuggingFace

- Task Execution: The expert model executes the assigned task on the inference endpoint, and records the execution information and inference results into the LLM

- Response generation: LLM summarizes the execution process logs and inference results, and returns the summary to the user

Multimodal capability, with

experiment settings

In the experiments, the researchers adopted two variants of GPT models, gpt-3.5-turbo and text-davinci-003, as large language models (LLMs), which are publicly accessible through the OpenAI API.

To make the output of LLM more stable, we set the decoding temperature to 0.

At the same time, in order to adjust the output of LLM to conform to the expected format, we set logit_bias to 0.1 on the format constraint.

The researchers provide detailed hints designed for the task planning, model selection, and response generation phases in the following tables, where {} indicates that the domain values need to be filled with corresponding texts before the hints are fed into the LLM.

The researchers tested HuggingGPT on a wide range of multimodal tasks.

With the cooperation of ChatGP and expert models, HuggingGPT can solve tasks in various modes such as language, image, audio and video, including tasks in various forms such as detection, generation, classification and question answering.

Although these tasks seem simple, mastering the basic ability of HuggingGPT is a prerequisite for solving complex tasks.

For example, the visual question answering task:

Text generation:

Vincent diagram:

HuggingGPT can integrate multiple input content for simple reasoning. It can be found that even if there are multiple task resources, HuggingGPT can decompose the main task into multiple basic tasks, and finally integrate the reasoning results of multiple models to get the correct answer.

In addition, the researchers tested and evaluated the effectiveness of HuggingGPT in complex task situations.

Demonstrated the ability of HuggingGPT to handle multiple complex tasks.

When processing multiple requests, it may contain multiple hidden tasks or needs and other information, and it is not enough to rely on an expert model to solve it.

HuggingGPT can organize the collaboration of multiple models through task planning.

A user request may explicitly contain multiple tasks:

The figure below shows the ability of HuggingGPT to deal with complex tasks in multi-round dialogue scenarios.

Users divide a complex request into several steps, and reach the final goal through multiple rounds of requests. It is found that HuggingGPT can track the context status of user requests through dialogue context management in the task planning stage, and can well address the requested resources and task planning mentioned by users.

"Jarvis" open source

At present, this project has been open sourced on GitHub, but the code has not been fully released.

Interestingly, the researchers named this project Jarvis in "Iron Man", and here comes the invincible AI.

JARVIS: A system connecting LLMs and the ML community

By the way, HuggingGPT needs OpenAI's API to be used.

Netizens: The Future of Research

JARVIS/HuggingGPT, like the Toolformer proposed by Meta before, is acting as a connector.

Even, ChatGPT plugins are included.

A netizen said, "I strongly suspect that the first artificial general intelligence (AGI) will appear earlier than expected. It will rely on "glue" artificial intelligence, which can intelligently glue a series of narrow artificial intelligence and practical tools together.

I gained access to the Wolfram plugin for ChatGPT, which turned it from math noob to math genius overnight. Of course, this is only a small step, but it indicates the future development trend.

I predict that in the next year or so we will see an AI assistant that is connected to dozens of large language models (LLMs) and similar tools, and the end user just needs to issue instructions to his assistant to make it Complete tasks for them. This sci-fi moment is coming.

Some netizens said that this is the future research method.

GPT is in front of a lot of tools and knows how to use them.

References:

https://twitter.com/johnjnay/status/1641609645713129473

https://news.ycombinator.com/item?id=35390153

Reprinted from 新智元 桃子 BrittaView Original

Comments

no dataCoffee time! Feel free to comment