One-click control of more than 100,000 AI models, HuggingFace has created an "APP Store" for ChatGPT-like models
With Transformers Agents, you can control more than 100,000 Hugging Face models to complete various multimodal tasks.
From chatting to programming to supporting various plug-ins, the powerful ChatGPT has long been no longer a simple dialogue assistant, but is constantly moving towards the "management" of the AI world.
On March 23, OpenAI announced that ChatGPT began to support various third-party plug-ins, such as the famous science and engineering artifact Wolfram Alpha. With the help of this artifact, ChatGPT, which was originally unacceptable for chickens and rabbits in the same cage, has become a top student in science and engineering. Many people on Twitter commented that the launch of the ChatGPT plugin looks a bit like the launch of the iPhone App Store in 2008. This also means that AI chatbots are entering a new stage of evolution - the "meta app" stage.picture
Then, in early April, researchers from Zhejiang University and Microsoft Asia Research Institute proposed an important method called "HuggingGPT", which can be seen as a large-scale demonstration of the above route. HuggingGPT lets ChatGPT act as a controller (which can be understood as a management layer), and it manages a large number of other AI models to solve some complex AI tasks. Specifically, HuggingGPT uses ChatGPT for task planning when receiving a user request, selects a model according to the functional description available in HuggingFace, executes each subtask with the selected AI model, and aggregates responses based on the execution results.
This approach can make up for many shortcomings of the current large model, such as limited modes that can be processed, and in some respects it is not as good as the professional model.
Although the model of HuggingFace is dispatched, HuggingGPT is not officially produced by HuggingFace after all. Just now, HuggingFace finally made a move.picture
Similar to the concept of HuggingGPT, they launched a new API - HuggingFace Transformers Agents. With Transformers Agents, you can control more than 100,000 Hugging Face models to complete various multimodal tasks.
For example, in the example below, you want Transformers Agents to explain aloud what is depicted on the picture. It will try to understand your instruction (Read out loud the content of the image), then convert it into a prompt, and select the appropriate model and tool to complete the task you specified.picture
Nvidia AI scientist Jim Fan commented: This day has finally come, and this is an important step towards the "Everything APP" (Everything APP).picture
However, some people say that this is not the same as the automatic iteration of AutoGPT. It is more like saving the steps of writing a prompt and manually specifying tools. It is still too early for the Master of Things APP.picture
Transformers Agents address: https://huggingface.co/docs/transformers/transformers_agents
How to use Transformers Agents?
At the same time as the release, HuggingFace released the Colab address, and anyone can try it out:
In short, it provides a natural language API on top of transformers: first a curated set of tools is defined, and an agent is designed to interpret natural language and use these tools.
Moreover, Transformers Agents are extensible by design.
The team has identified a set of tools that can be empowered to the agent, the following is the list of integrated tools:
- Document Question Answering: Given a document in image format (e.g. PDF), answer questions about that document (Donut)
- Text Question Answering: Given a long text and a question, answer the question in the text (Flan-T5)
- Unconditional Image Captions: Add Captions to Images (BLIP)
- Image Quiz: Given an image, answer questions about it (VILT)
- Image Segmentation: Given an image and a prompt, output the segmentation mask for that prompt (CLIPSeg)
- Speech-to-Text: Given a recording of a person speaking, transcribe speech into text (Whisper)
- Text to Speech: Convert text to speech (SpeechT5)
- Zero-shot text classification: given a list of text and labels, determine which label the text corresponds to most (BART)
- Text Summarization: Summarize a long text in one or a few sentences (BART)
- Translate: Translate the text into the given language (NLLB)
These tools are integrated in transformers and can also be used manually:
from transformers import load_tooltool = load_tool("text-to-speech")audio = tool("This is a text to speech tool")
Users can also push a tool's code to Hugging Face Space or a model repository to leverage the tool directly through the agent, such as:
- Text Downloader: Download text from web URL
- Text to image : Generate an image according to the prompt, using Stable Diffusion
- Image conversion: Modify an image given an initial image and a prompt, using instruct pix2pix stable diffusion
- Text to video : Generate a small video according to the prompt, using damo-vilab
For specific gameplay, let's look at a few examples of HuggingFace:
Generate image descriptions:
agent.run("Caption the following image", image=image)picture
agent.run("Read the following text out loud", text=text)
Enter: A beaver is swimming in the water
tts_example audio: 00:00__00:01
Read the file:picture
Get started quicklypicture
Before running agent.run, a large language model agent needs to be instantiated. OpenAI models and open source models such as BigCode and OpenAssistant are supported here.
First, install the agents add-on to install all default dependencies:
pip install transformers[agents]
To use the openAI model, you need to instantiate an "OpenAiAgent" openai after installing the dependencies:
pip install openaifrom transformers import OpenAiAgentagent = OpenAiAgent(model="text-davinci-003", api_key="<your_api_key>")
To use BigCode or OpenAssistant, first log in to access the inference API:
from huggingface_hub import loginlogin("<YOUR_TOKEN>")
Then, instantiate the agent:
from transformers import HfAgentStarcoderagent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder")StarcoderBaseagent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoderbase" )OpenAssistantagent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5")
If users have their own inference endpoints for this model (or another model), they can replace the URL above with their own URL endpoints.
Next, let's take a look at the two APIs provided by Transformers Agents:
A single execution is when using the agent's run() method:
agent.run("Draw me a picture of rivers and lakes.")picture
It automatically selects the appropriate tool for the task to be performed and executes it appropriately, one or more tasks can be performed in the same instruction (although the more complex the instruction, the more likely the agent will fail).
agent.run("Draw me a picture of the sea then transform the picture to add an island")picture
Each run() operation is independent, so it can be run multiple times in succession for different tasks. If they want to keep state or pass non-textual objects to the agent during execution, the user can do so by specifying the variables they want the agent to use. For example, a user could generate the first image of a river and lake and ask the model to update that image to add an island by doing the following:
picture = agent.run("Generate a picture of rivers and lakes.")updated_picture = agent.run("Transform the image in picture to add an island to it.", picture=picture)
This is helpful when the model fails to understand the user's request and mixes tools. An example is:
agent.run("Draw me the picture of a capybara swimming in the sea")
Here, the model can be interpreted in two ways:
- Let text-to-image capybaras swim in the sea
- Alternatively, generate a text-to-image capybara, then use the image-transformation tool to make it swim in the ocean
If the user wants to enforce the first case, he can do so by passing prompt to it as an argument:
agent.run("Draw me a picture of the prompt", prompt="a capybara swimming in the sea")
The agent also has a chat-based method:
agent.chat("Generate a picture of rivers and lakes")
agent.chat ("Transform the picture so that there is a rock in there")
This is when state can be preserved across instructions. It is more suitable for experimentation, but performs better on single instructions, while the run() method is better at handling complex instructions. The method can also accept parameters if the user wants to pass a non-text type or a specific prompt.