Visual AI capabilities are unified! Automatic image detection and segmentation, and controllable Wensheng map, produced by the Chinese team
Now it is indeed time for the AI circle to fight for hand speed.
No, Meta's SAM has just been launched a few days ago, and some domestic programmers have come to wave buff superposition, which combines target detection, segmentation, and generation of several major visual AI functions all in one!
For example, based on Stable Diffusion and SAM, the chair in the photo can be seamlessly replaced with a sofa:
Dressing up and changing hair color is also so easy:
Once the project was released, many people exclaimed: the hand speed is too fast!
Someone else said: I have a new wedding photo with Yui Aragaki .
The above is the effect brought by Gounded-SAM . The project has achieved 1.8k stars on GitHub.
In simple terms, this is a zero-shot vision application that can automatically detect and segment images only by inputting pictures.
The research comes from IDEA Research Institute (Digital Economy Research Institute of Guangdong-Hong Kong-Macao Greater Bay Area), whose founder and chairman is Shen Xiangyang.
No additional training required
Grounded SAM is mainly composed of two models, Grounding DINO and SAM.
Among them, SAM (Segment Anything) is a zero-sample segmentation model launched by Meta just 4 days ago.
It can generate masks for any object in the image/video, including objects and images that did not appear in the training process.
By having the SAM return a valid mask for any hint, it is possible to tell the model that even if the hint is ambiguous or refers to multiple objects, the output should be a plausible mask among all possibilities. This task is used to pre-train models and solve general downstream segmentation tasks with hints.
The model framework mainly consists of an image encoder, a hint encoder and a fast mask decoder. After computing image embeddings, SAM is able to generate a segmentation from any cue in the web within 50 milliseconds.
Grounding DINO is an existing result of the research team.
This is a zero-shot detection model that generates object boxes and labels with textual descriptions.
After the combination of the two, any object in the picture can be found through the text description, and then the mask can be fine-grained through the powerful segmentation ability of SAM .
On top of these capabilities, they also superimposed the ability of Stable Diffusion, which is the controllable image generation shown at the beginning.
It is worth mentioning that Stable Diffusion was able to achieve similar functions before. Just smudge out the image element you want to replace, and enter a text hint.
This time, Grounded SAM can save the step of manual selection and control it directly through text description.
In addition, it combines BLIP (Bootstrapping Language-Image Pre-training) to generate image titles, extract tags, and then generate object boxes and masks.
Currently, many more interesting features are in development.
For example, some expansions in terms of characters: changing clothes, hair color, skin color, etc.
The specific eating method has also been given on GitHub. The project requires Python 3.8 or higher, pytorch 1.7 or higher, torchvision 0.8 or higher, and related dependencies must be installed. See the GitHub project page for details.
The research team is from IDEA Research Institute (Digital Economy Research Institute of Guangdong-Hong Kong-Macao Greater Bay Area).
According to public information, the institute is an international innovative research institution for artificial intelligence, digital economy industry and cutting-edge technology. Dr. Shen Xiangyang, former chief scientist of Microsoft Asia Research Institute and former vice president of Microsoft Global Smart Mobility, serves as the founder and director long.
One More Thing
For the future work of Grounded SAM, the team has several prospects:
- Automatically generate images to form new datasets
- Powerful base model with split pre-training
- Work with (Chat-)GPT
- Constitute a pipeline that automatically generates image labels, boxes and masks, and can generate new images.
It is worth mentioning that many of the team members of this project are active respondents in the field of Zhihu AI. This time, they also answered the content about Grounded SAM on Zhihu. Interested children's shoes can leave a message ask~
Reference link:  https://zhuanlan.zhihu.com/p/620271321  https://github.com/IDEA-Research/Grounded-Segment-Anything  https://segment-anything.com/