MiniGPT-4 is on the scene: recognize pictures, chat with pictures, build websites with sketches, the ability will surprise you
GPT-4 has been released for more than a month, but its image recognition function has not yet been experienced. Researchers at the King Abdullah University of Science and Technology have launched a similar product, MiniGPT-4, and now people can try it out.
For humans, it is trivial to understand the information of a picture. Humans can almost say the meaning of the picture casually, just like the charger plugged into the mobile phone is not suitable. People can see the problem at a glance, but for AI, there are still great difficulties.
The advent of GPT-4 made it easy to fix these problems, and it quickly pointed out the problem in the picture: the VGA cable used to charge the iPhone.
In fact, the charm of GPT-4 is far more than that. What's even more exciting is that it can generate websites directly from hand-drawn sketches. Just draw a scribbled schematic on scratch paper, take a photo, and send it to GPT-4, which can then write website code based on the schematic.
Unfortunately, GPT-4 has not disclosed this function yet, so it cannot be experienced. However, a team at the King Abdullah University of Science and Technology has developed a GPT-4-like MiniGPT-4 product. The researchers in the team include Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li and Mohamed H. Elhoseiny, all of whom are from the Vision-CAIR research group of KAUST.
MiniGPT-4 demonstrated many GPT-4-like capabilities, such as generating detailed image descriptions and creating websites from handwritten drafts. In addition, the authors observed other emerging capabilities of MiniGPT-4, including composing stories and poems based on given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc.
Look at the pictures and talk
We can illustrate with a few examples how well MiniGPT-4 performs. In order to better experience MiniGPT-4, please use English input for testing.
First, let's look at the ability of MiniGPT-4 to describe pictures. For the image on the left, MiniGPT-4's answer is roughly "This is a cactus growing on a frozen lake. It is surrounded by huge ice crystals, and there are snow-covered mountains in the distance...". If you ask whether such sights are common in the real world, MiniGPT-4 will explain that such sights are rare and give corresponding reasons.
Next, let's take a look at the picture question answering ability of MiniGPT-4. Ask, "What's wrong with this plant? What should I do?" MiniGPT-4 not only pinpoints the problem, but also mentions that the leaves with brown spots are likely caused by a fungal infection and offers steps to treat them.
Through these examples, it can be seen that the image understanding and interaction capabilities of MiniGPT-4 are already very powerful. Additionally, MiniGPT-4 is capable of authoring websites from sketches. For example, take the sketch on the left as an example, let MiniGPT-4 draw the webpage and generate the corresponding HTML code. MiniGPT-4 gave the website according to the instructions and achieved the expected effect.
With the help of MiniGPT-4, it is also very easy to write slogans for pictures. Take the cup on the left as an example, let MiniGPT-4 write the advertisement copy. MiniGPT-4 accurately pointed out that there is a pattern of sleeping cats printed on the cup, which is very suitable for coffee and cat lovers, and described details such as the material of the cup.
Finally, MiniGPT-4 can also generate recipes based on a picture, gloriously becoming our little chef.
Look at pictures and write poems:
MiniGPT-4 Demo is now open and can be played online, you can experience it yourself:
Demo address: https://0810e8582bcad31944.gradio.live/
As soon as the project is released, it attracts attention. Some netizens asked MiniGPT-4 to explain the objects in the picture:
Other netizens tried various tests:
The article believes that the reason why GPT-4 has advanced multi-modal generation capabilities is mainly due to its advanced large-scale language model (LLM). To explore this phenomenon, the authors propose MiniGPT-4, which uses a projection layer to align a frozen vision encoder with a frozen LLM (Vicuna).
MiniGPT-4 consists of a pretrained ViT and Q-Former visual encoder, a single linear projection layer, and an advanced Vicuna large-scale language model. MiniGPT-4 only needs to train linear layers to align visual features with Vicuna.
MiniGPT-4 is trained in two stages. The first traditional pre-training stage uses about 5 million aligned image-text pairs and is trained for 10 hours on 4 A100 GPUs. After the first stage, Vicuna was able to understand the image. However, Vicuna's text generation ability suffers greatly.
To solve this problem and improve usability, the researchers propose a novel way to create high-quality image-text pairs through the model itself and ChatGPT. To this end, the study created a small but high-quality dataset (3500 pairs in total).
The second fine-tuning stage is trained on this dataset using dialogue templates to significantly improve its generation reliability and overall usability. This stage has efficient computing power, and it only takes about 7 minutes to complete with an A100GPU.
Demo address: https://0810e8582bcad31944.gradio.live/
Paper address: https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPT_4.pdf Paper homepage: https://minigpt-4.github.io/ Code address: https://github.com/Vision-CAIR/MiniGPT-4