One step closer to visual unification: After dividing everything, Meta has open sourced a set of multi-purpose large models
DINOv2 can be used for a variety of vision tasks without fine-tuning.
After open-sourcing the SAM model of "splitting everything", Meta has gone further and further on the road of "visual basic model".
This time, they open sourced a set of models called DINOv2. These models produce high-performance visual representations that can be used for downstream tasks such as classification, segmentation, image retrieval, and depth estimation without fine-tuning.
This set of models has the following characteristics:
- Use self-supervised training without the need for large amounts of labeled data;
- Can be used as the backbone of almost all CV tasks without fine-tuning, such as image classification, segmentation, image retrieval and depth estimation;
- Learning features directly from images without relying on textual descriptions, which allows the model to better understand local information;
- can be learned from any collection of images;
- A pretrained version of DINOv2 is already available and is comparable to CLIP and OpenCLIP on a range of tasks.
Learning task-neutral pretrained representations has become a standard in natural language processing. You can use these features "as is" (without fine-tuning), and they perform significantly better than task-specific models on downstream tasks. This success is due to pretraining on large amounts of raw text using auxiliary objectives, such as language modeling or word embeddings, which do not require supervision.
With this paradigm shift in the field of NLP, expect similar "foundational" models to emerge in computer vision. These models should generate visual features that work "out of the box" on any task, whether at the image level (e.g. image classification) or pixel level (e.g. segmentation).
There is great hope that these basic models can focus on text-guided pre-training, that is, using a form of text supervision to guide the training of features. This form of text-guided pre-training limits the information about images that can be preserved, as captions only approximate the rich information in images, and finer, complex pixel-level information may not be uncovered with this supervision. Furthermore, these image encoders require already aligned text-image corpora and cannot provide the flexibility of their text counterparts, i.e. cannot learn from raw data alone.
An alternative to text-guided pretraining is self-supervised learning, where features are learned from images only. These methods are conceptually closer to predecessor tasks such as language modeling, and can capture information at the image and pixel level. However, despite their potential to learn general-purpose features, most of the improvements in self-supervised learning have been achieved in the context of pre-training on the small refined dataset ImageNet1k. Some researchers have attempted to extend these methods beyond ImageNet-1k with some efforts, but they focused on unfiltered datasets, which often resulted in a significant loss of performance quality. This is due to a lack of control over data quality and diversity, which are critical to producing good results.
In this work, the researchers explored the possibility of self-supervised learning to learn general visual features if pre-trained on a large amount of refined data. They revisit existing discriminative self-supervised methods that learn features at the image and patch levels, such as iBOT, and reconsider some of their design choices under larger datasets. Most of our technical contributions are tailored to stabilize and accelerate discriminative self-supervised learning as we scale model and data sizes. These improvements make their method around 2x faster and require 1/3 less memory than similar discriminative self-supervised methods, allowing them to take advantage of longer training and larger batch sizes.
Regarding the pre-training data, they built an automated pipeline for filtering and rebalancing the dataset from a large unfiltered collection of images. This was inspired by pipelines used in NLP, where data similarity is used instead of external metadata, and manual annotation is not required. A major difficulty when working with images is to rebalance concepts and avoid overfitting in some dominant modes. In this work, a naive clustering approach works well for this problem, and the researchers collected a small and diverse corpus of 142M images to validate their approach.
Finally, the researchers provide various pretrained vision models, called DINOv2, trained on their data using different Vision Transformer (ViT) architectures. They release all models and code to retrain DINOv2 on any data. At scale, they validate the quality of DINOv2 on various computer vision benchmarks at the image and pixel level, as shown in Figure 2. Finally, the researchers conclude that self-supervised pre-training alone is a good candidate for learning transferable frozen features, comparable to the best publicly available weakly-supervised models.
The researchers assembled their refined LVD-142M dataset by retrieving images from the bulk of the unfiltered data that were close to images in multiple refined datasets. In their paper, they describe the main components in the data pipeline, including curated/unfiltered data sources, image deduplication steps, and retrieval systems. The entire pipeline does not require any metadata or text, and directly processes images, as shown in Figure 3. The reader is referred to Appendix A for more details on the model methodology.
Figure 3: Overview of the pipeline for data processing. Images from refined and non-refined data sources are first mapped to embeddings. The unedited images are then deduplicated before being matched with standard images. The resulting combination further enriches the initial dataset with a self-supervised retrieval system.
Discriminative self-supervised pre-training
The researchers learn their features via a discriminative self-supervised approach, which can be viewed as a combination of DINO and iBOT losses, centered on SwAV. They also added a regularizer to propagate features and a short high-resolution training phase.
They considered several improvements to train the model on a larger scale. The model was trained on an A100 GPU using PyTorch 2.0, and the code can also be used with pretrained models for feature extraction. Details of the model are in Table 17 in the Appendix. Under the same hardware, compared with the iBOT implementation, the DINOv2 code uses only 1/3 of the memory and runs up to 2 times faster than the former.
In this section, the researchers present empirical evaluations of the new model on a number of image understanding tasks. They evaluate global and local image representations, including category and instance-level recognition, semantic segmentation, monocular depth prediction, and action recognition.