Segment Anything Model (SAM): a new AI model from Meta AI that can “cut out” any object, in any image, with a single click
You can click here to experience the model demo for free
SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.
Previously, to solve any kind of segmentation problem, there were two classes of approaches. The first, interactive segmentation, allowed for segmenting any class of object but required a person to guide the method by iteratively refining a mask. The second, automatic segmentation, allowed for segmentation of specific object categories defined ahead of time (e.g., cats or chairs) but required substantial amounts of manually annotated objects to train (e.g., thousands or even tens of thousands of examples of segmented cats), along with the compute resources and technical expertise to train the segmentation model. Neither approach provided a general, fully automatic approach to segmentation.
SAM is a generalization of these two classes of approaches. It is a single model that can easily perform both interactive segmentation and automatic segmentation. The model’s promptable interface (described shortly) allows it to be used in flexible ways that make a wide range of segmentation tasks possible simply by engineering the right prompt for the model (clicks, boxes, text, and so on). Moreover, SAM is trained on a diverse, high-quality dataset of over 1 billion masks (collected as part of this project), which enables it to generalize to new types of objects and images beyond what it observed during training. This ability to generalize means that, by and large, practitioners will no longer need to collect their own segmentation data and fine-tune a model for their use case.
Taken together, these capabilities enable SAM to generalize both to new tasks and to new domains. This flexibility is the first of its kind for image segmentation. - SAM allows users to segment objects with just a click or by interactively clicking points to include and exclude from the object. The model can also be prompted with a bounding box.
SAM can output multiple valid masks when faced with ambiguity about the object being segmented, an important and necessary capability for solving segmentation in the real world.
SAM can automatically find and mask all objects in an image.
SAM can generate a segmentation mask for any prompt in real time after precomputing the image embedding, allowing for real-time interaction with the model.
In natural language processing and, more recently, computer vision, one of the most exciting developments is that of foundation models that can perform zero-shot and few-shot learning for new datasets and tasks using “prompting” techniques. We took inspiration from this line of work.
We trained SAM to return a valid segmentation mask for any prompt, where a prompt can be foreground/background points, a rough box or mask, freeform text, or, in general, any information indicating what to segment in an image. The requirement of a valid mask simply means that even when a prompt is ambiguous and could refer to multiple objects (for example, a point on a shirt may indicate either the shirt or the person wearing it), the output should be a reasonable mask for one of those objects. This task is used to pretrain the model and to solve general downstream segmentation tasks via prompting.
We observed that the pretraining task and interactive data collection imposed specific constraints on the model design. In particular, the model needs to run in real time on a CPU in a web browser to allow our annotators to use SAM interactively in real time to annotate efficiently. While the runtime constraint implies a trade-off between quality and runtime, we find that a simple design yields good results in practice.
Under the hood, an image encoder produces a one-time embedding for the image, while a lightweight encoder converts any prompt into an embedding vector in real time. These two information sources are then combined in a lightweight decoder that predicts segmentation masks. After the image embedding is computed, SAM can produce a segment in just 50 milliseconds given any prompt in a web browser.