Hyun to the explosion! The online demo of HuggingGPT made a stunning debut, and the images generated by netizens’ personal tests are amazing
After the HuggingGPT launched by Zhejiang University & Microsoft became popular, the demo has just been opened, and netizens who can't wait can experience it by themselves.
The strongest combination HuggingFace+ChatGPT="Jarvis" is now open for demo.
Some time ago, Zhejiang University & Microsoft released HuggingGPT, a large-scale model collaboration system, which exploded immediately.
The researchers proposed to use ChatGPT as a controller to connect various AI models in the HuggingFace community to complete multimodal complex tasks.
In the whole process, all you need to do is: output your requirements in natural language.
According to Nvidia scientists, this is the most interesting paper I have read this week. Its idea is very close to the "Everything App" I said before, that is, everything is an App, and the information is directly read by AI.
hands-on experience
Now, HuggingGPT adds Gradio demo.
Project address: https://github.com/microsoft/JARVIS
Some netizens tried it first, and first came "how many people are on the identification map"?
Based on the inference results, HuggingGPT concluded that there are 2 people walking on the street in the picture.
The specific process is as follows:
First use the image-to-text model nlpconnect/vit-gpt2-image-captioning for image description, and the generated text "2 women walking on the street with a train".
Next, the object detection model facebook/detrresnet 50 is used to detect the number of people in the picture. The model detects 7 objects and 2 people.
Then use the visual question answering model dandelin/vilt-b32-finetuned-vqa to get the results. Finally, the system provides detailed responses and model information used to answer questions.
Also, let it understand the emotion of the phrase "I love you" and translate it into Tamil (Tamiḻ).
HuggingGPT invokes the following models:
First, the model "dslim/bert-base-NER" is used to classify the sentiment of the text "l love you", which is "romantic".
Then, use "ChatGPT" to translate the text into Tamil, which is "Nan unnai kadalikiren".
There are no generated image, audio or video files in the inference results.
HuggingGPT fails when transcribing MP3 files. Netizens said, "Not sure if this is a problem with my input file."
Let's look at the ability to generate images.
Add the text "I LOVE YOU" as an overlay on top of the "A cat dancing" image.
HuggingGPT first uses the "runwayml/stable-diffusion-1-5" model to generate a picture of a "dancing cat" from a given text.
Then, the same model was used to generate a picture of "I LOVE YOU" from the given text.
Finally, merge the two pictures together, and the output is as follows:
Jarvis shines into reality
A few days after the project was made public, Jarvis has gained 12.5k stars and 811 forks on GitHub.
The researchers pointed out that solving the current problems of large language models (LLMs) may be the first and crucial step towards AGI.
Because current techniques for large language models still have some shortcomings, there are some pressing challenges on the way to building AGI systems.
To handle complex AI tasks, LLMs should be able to coordinate with external models to exploit their capabilities.
Therefore, the key point is how to choose the appropriate middleware to bridge LLMs and AI models.
In this research paper, the researchers propose that language is the common interface in HuggingGPT. Its workflow is mainly divided into four steps:
Paper address: https://arxiv.org/pdf/2303.17580.pdf
The first is task planning, where ChatGPT parses user requests, decomposes them into multiple tasks, and plans task sequences and dependencies based on its knowledge.
Next, model selection is performed. LLM assigns the parsed tasks to expert models according to the model description in HuggingFace.
Then execute the task. The expert model executes the assigned tasks on the inference endpoint, and records the execution information and inference results into the LLM.
The last is response generation. LLM summarizes the execution process logs and inference results and returns the summary to the user.
If such a request is given:
Please generate a picture of a girl reading a book in the same pose as the boy in example.jpg. Then please describe the new picture with your voice.
You can see how HuggingGPT decomposes it into 6 subtasks, and selects the model to execute to get the final result.
By incorporating AI model descriptions into hints, ChatGPT can be regarded as the brain that manages the AI model. Therefore, this approach enables ChatGPT to invoke external models to solve practical tasks.
Simply put, HuggingGPT is a collaborative system, not a big model.
Its role is to connect ChatGPT and HuggingFace, and then process the input of different modalities, and solve many complex artificial intelligence tasks.
Therefore, each AI model in the HuggingFace community has a corresponding model description in the HuggingGPT library, and it is fused into the prompt to establish a connection with ChatGPT.
HuggingGPT then uses ChatGPT as the brain to determine the answer to the question.
So far, HuggingGPT has integrated hundreds of models on HuggingFace around ChatGPT, covering 24 tasks such as text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video.
Experimental results demonstrate that HuggingGPT can perform well on various forms of complex tasks.
Hot comments from netizens
Some netizens said that HuggingGPT is similar to Microsoft's previously proposed Visual ChatGPT, and it seems that they have extended the original idea to a large set of pre-trained models.
Visual ChatGPT is built directly on top of ChatGPT and injected with many visual models (VFMs). Prompt Manage is proposed in this paper.
With the help of PM, ChatGPT can exploit these VFMs and receive their feedback in an iterative manner until the user's requirement is satisfied or the end condition is reached.
Some netizens believe that this idea is indeed very similar to the ChatGPT plugin. Semantic understanding and task planning centered on LLM can infinitely improve the capability boundary of LLM. By combining LLM with other functional or domain experts, we can create more powerful and flexible AI systems that can better adapt to various tasks and needs.
This is how I've always thought about AGI, AI models that understand complex tasks and then delegate smaller tasks to other more specialized AI models.
Just like the brain, it also has different parts for specific tasks, which sounds logical.
References:
https://twitter.com/1littlecoder/status/1644466883813408768
https://www.youtube.com/watch?v=3_5FRLYS-2A
https://huggingface.co/spaces/microsoft/HuggingGPT