Xiao Zha personally announced the Meta visual model! Self-supervised learning does not require fine-tuning, and the multi-tasking effect exceeds OpenCLIP
No need for text labels, the fully self-supervised Meta vision model is here!
Xiao Zha made an official announcement in person, and gained a lot of attention when it was released——
In tasks such as semantic segmentation, instance segmentation, depth estimation, and image retrieval, this large visual model named DINOv2 has achieved very good results.
It even surpasses OpenCLIP, the current best open source visual model.
Although Meta has previously released the self-supervised learning visual large model DINO, but this time AI's ability to recognize image features is obviously further improved, and the subject in the video is accurately segmented:
Don't think that DINOv2 only learns image segmentation through self-supervision. In fact, it has been able to accurately identify where the head, body and limbs of the same object (dog) grow based on photos of different categories and different scenes:
In other words, DINOv2 learns to find image features by itself.
At present, the Meta official has not only released the open source code, but also gave the web version of the Demo a trial run. There are netizen connotations:
What is open source, LLaMA, SAM, DINOv2 is called open source!
Let's take a look at the effect of DINOv2.
Accurately identify the same object with different painting styles
In fact, DINOv2 is a large visual model based on the previous generation DINOv1.
The parameter size of this model is 1 billion levels, and it is still a visual Transformer architecture (ViT), but unlike DINO, this time DINOv2 has been carefully selected on the data set.
Specifically, DINOv2 builds a data filtering pipeline to carefully filter out pictures with similar content and exclude the same pictures at the same time:
Although the training data images finally presented to DINOv2 do not have text labels , the features of these images are indeed similar.
What is the effect of the visual model trained with this kind of data?
This is the performance of DINOv2 on 8 visual tasks, including semantic segmentation, classification, depth estimation, etc., where the orange is the effect of the self-supervised method, and the dark pink is the effect of the weakly supervised method.
It can be seen that the performance of the visual model that has undergone self-supervised learning is comparable to that of the model that has undergone weakly supervised learning.
The actual effect is also good. Even in a series of photos, the painting style of the same object is not similar, DINOv2 can accurately identify their characteristics and classify them into similar lists.
Such as (a) bird and airplane with wings in group (b), elephant and elephant sculpture in group (c), car and car toy model in group (d), horse and doodle horse in group (d) :
And judging from the PCA (Principal Component Analysis) image effect, DINOv2 can not only classify accurately, but also mark their "same" parts with different colors, for example, the elephant trunks are all green, the wheels are all red, and the horse's tail is yellow. wait.
In other words, DINOv2 understands similarities in these images, like a human would describe an airplane "looks like a bird".
At present, DINOv2 has released a demo, and we have also tried its actual effect.
Demo is directly playable
The official website has opened the demo of the three major functions of semantic segmentation, image retrieval and depth estimation.
According to Meta, among these tasks, DINOv2 surpasses OpenCLIP, the best performing open source visual model, on most benchmarks.
Let's first look at the effect of depth estimation .
It is worth mentioning that DINOv2 runs faster than iBOT when the effect is better. Under the same hardware, it only needs one-third of the memory, and the running speed can be more than 2 times faster than DINOv2.
Here's how the Meta paper compares with OpenCLIP on a practical example:
Let's try it out with this hunk version of Xinbaodao, it looks pretty good, and it can estimate the depth better even in high-blurry images:
Next is the effect of semantic segmentation . Here is also the data comparison in the Meta paper:
Here is also a comparison between OpenCLIP and DINOv2. The middle picture is the effect of OpenCLIP, and the right is the effect of DINOv2 segmentation:
We also tried it with a picture of the office. It seems that DINOv2 can still segment the human body and objects more accurately, but there will be some noise in the details:
Finally, image retrieval .
The picture effect given on the official website is quite good. Input the picture of the iron tower, and you can generate many similar art pictures with the iron tower:
Here we also tried, input a piece of Huaqiang to buy melons, and most of the art pictures given are related to watermelons:
So, where can such a large self-supervised vision model be used?
Judging from the video given by Meta, there are currently some more environmentally friendly uses, such as estimating the height of trees around the world:
In addition, as Zuckerberg said, DINOv2 can also be used to improve medical imaging, food crop growth, etc. Of course, Xiao Zha further emphasized here:
Can be used to make a more immersive Metaverse.
Well, it looks like Meta's metaverse route will continue...
Trial demo address: https://dinov2.metademolab.com/demos
project address: https://github.com/facebookresearch/dinov2
Reference link: https://www.facebook.com/zuck/posts/pfbid02f3chCYQphfYnzRaDXeJxsT5EmyhbrFsjqLaU31KuTG63Ca4yMXFcDXQcukYPbWUMl