An Introduction to the Foundational Model of Image Segmentation — Segment Anything (SAM)

11 min readApr 21, 2023

Kirillov et al., “Segment Anything, ” In Arxiv Preprint.

Project page: https://segment-anything.com/

Paper link: https://arxiv.org/abs/2304.02643

Demo link: https://segment-anything.com/demo

The Facebook research team FAIR introduces Segment Anything, releasing the Foundation Model for image segmentation called Segment Anything Model (SAM), along with the corresponding training dataset (SA-1B). SAM supports various prompts to predict different types of image segmentation (Interactive Segmentation, Semantic Segmentation, Instance Segmentation, Foreground Segmentation, etc.). Experiments demonstrate that the SAM model can adapt in a zero-shot manner to tasks not seen in the training set (such as Edge Segmentation, Object Proposal, and Instance Segmentation).

Due to the lack of a large number of segmentation datasets on the internet, the paper devotes a significant portion to discussing how to collect extensive datasets. Although this paper has just been released, the community has already developed many applications that combine SAM, such as integrating SAM’s predicted masks with the image generation model Stable Diffusion. This integration further extends the possibilities of changing generated categories or colors based on masks, and so on. It is highly recommended for everyone to first visit the project page: Segment Anything and explore the interactive UI provided by SAM to see the achievable effects.

Concept

Foundational Model:

The concept of a Foundational Model revolves around training on large-scale datasets and adapting to various datasets/tasks. Although similar to the notion of pre-trained models in recent years, Foundational Models emphasize the size of the data collection. Moreover, the models aim to possess generalization capabilities, more specifically, zero-shot abilities. Even if a model has not learned a task during training, it can still achieve corresponding results by providing appropriate prompts. Designing a Foundational Model involves examining three aspects (tasks, models, and data):

What tasks should be learned to enable the model’s zero-shot generalization?
What model architecture should be chosen?
How can diverse data be collected?

Prompt:

Existing Foundational Models offer prompts, allowing users to input different types of prompts to execute various downstream tasks. Prompts are not limited to textual information (Text) but can also incorporate spatial information (Bounding box, Point, Mask, Layout, etc.).

Compositionality:

Recent models tend to develop high compositionality, such as image generation models like DALL·E and Stable Diffusion that integrate the CLIP model. The paper also mentions the hope that SAM can serve as a plugin to expand multiple models in the future. Examples include reconstructing 3D models of unseen objects based on RGB-D images or combining wearable devices to capture objects in the user’s line of sight and display the corresponding information.

Model Design Concept

The design of the model in this paper revolves around the following three questions:

What tasks should be learned to enable the model’s zero-shot generalization?

An appropriate objective function is needed to enable the model to adapt to a wide range of downstream tasks.
Designing promptable segmentation allows for zero-shot and few-shot adaptation to new datasets and tasks through prompts.
Prompts should support both textual and spatial information.

2. What model architecture should be chosen?

The architecture must support flexible prompts, and in order to allow users to adjust prompts in real-time via the UI, it should be capable of instantaneously outputting segmented images.
The model consists of three stages: an image encoder for feature extraction, a prompt encoder module for input prompts, and a mask decoder for predicting image segmentation results. Considering that users will continuously adjust prompts during interactive design, the features extracted from the image encoder can be reused. The quality of the model depends mainly on a powerful image encoder. Therefore, it is suggested that the image encoder be a robust model, with computation offloaded to cloud servers.
The prompt encoder and mask decoder architectures should be lightweight, allowing for execution directly on the browser using the CPU and keeping computation time below 50ms. This ensures a smooth user experience without feeling unresponsive.

3. How to collect diverse data?

There is no large-scale image segmentation dataset available online, mainly because annotating pixel-level data is very time-consuming. Therefore, this paper proposes a Data Engine for data collection. The concept is to use existing models to predict image segmentation results first, and then have humans fine-tune them. Through continuous iteration, more data will be collected, and the model’s accuracy will improve.

The Data Engine can be divided into three stages: assisted manual annotation, semi-automatic annotation, and fully automatic annotation. The first stage, assisted manual annotation, involves using existing models for prediction and interactive tools to help annotators with the labeling process. In the second stage, the model trained in the first stage can automatically annotate some objects, so annotators only need to label the remaining categories. In the third stage, SAM can automatically annotate every small grid in an image using prompts. An image can be divided into 32x32 small grids, with a point marked at the center of each grid as a prompt. The model can generate the corresponding mask based on that point. On average, 100 high-quality masks can be automatically generated for each image at this stage.

Through continuous model optimization and automatic generation of high-quality training images, a data collection loop is created. Ultimately, this leads to the largest image segmentation dataset to date, SA-1B, which includes 1 billion masks and 11 million licensed images, with an average image size of 3,300 x 4,950.

It was shown that the accuracy of the model trained with approximately 1 million (1M) images approaches the accuracy of a model trained with 11 million images.

In recent years, Foundational Models can quickly adapt to various downstream tasks and be practically deployed in different fields. As a result, most companies take on the responsibility of Responsible AI (RAI) before releasing their models to avoid potential unfairness or prevent erroneous and biased predictions. To ensure data diversity, the distribution of images from different countries is analyzed. Most countries have collected over 1,000 images. Compared to the COCO dataset, the SA-1B dataset has a larger proportion of data collected from middle-income and low-income countries, unlike the COCO dataset, where the majority of the data comes from high-income countries.

Model Architecture

Image encoder: This stage extracts the image features or embeddings using a ViT (Vision Transformer) model trained based on MAE (Masked AutoEncoders). The output features of this module are called Image embeddings.
Prompt encoder: The Prompt encoder supports both text and spatial information. Spatial information can be divided into two types: Sparse prompts, such as Points and boxes, and Dense prompts, such as Masks. To provide a good user experience by reducing computation time, the design for converting various prompts into embeddings is relatively simple.

For Dense prompts, Masks are combined with convolution and downsampling to become Mask embeddings, which are then directly added to the Image embeddings.
For Sparse prompts, the spatial coordinates are converted into Positional embeddings, combined with their respective type embeddings (e.g., Box embedding, Point embedding), and then directly added to the Image embeddings.

Mask decoder: Referring to the Mask Classification architecture of DETR and MaskFormer, this stage can simultaneously predict Semantic and Instance-level image segmentation results. The overall training method is simple, using Focal Loss + Dice Loss for training.

A detail worth noting is that, in addition to learning different types of Prompt Embeddings (Box embedding, Point embedding), the Mask decoder also learns different Embeddings, which can be considered as class tokens (e.g., iou_token, mask_tokens). The concept is similar to that of the ViT (An image is worth 16x16 words: Transformers for image recognition at scale) paper.

Exploring the difficulties of the segmentation foundational model

To accommodate various downstream applications, the model needs to support different types of prompts, such as text, bounding boxes, points, masks, etc. It should even be able to handle multiple types of prompts at once. For instance, in an Interactive Segmentation task, users can click on areas of interest (green points) and areas of disinterest (red points) to extract the corresponding masks.

Prompts are ambiguous cues or hints that guide the model in generating an output based

In the example below, it can be seen that the prompt provided is relatively ambiguous. For instance, in the third column of the image, when clicking on a person’s arm, it can be interpreted in different ways — the model could either select the entire person, just the two hands, or only the left hand. There are many real-life examples of such ambiguity, like when a point is on a shirt, it could represent the shirt or the person wearing it. Therefore, it’s crucial to design the model training in a way that it can handle these ambiguous situations and ensure that it produces the correct mask for different prompts.

In practical use, text prompts are often used as input, but sometimes relying solely on text might not yield the correct results. However, by combining different types of prompts, the model can achieve accurate results.

To endow the model’s generation with a certain level of ambiguity, the model generates multiple different Masks. During training, only the Mask with the minimum loss between the predicted Mask and the Ground Truth Mask is chosen for Backpropagation. Experiments show that the number of predicted Masks can be set to 3, as these three different Masks often correspond to the Whole, Part, and Subpart. In order for the model to determine which output to choose in the end, there is a small model predicting IoU (Intersection over Union). The final output Mask is determined based on the scores of the predicted Masks. Therefore, when training the Mask decoder, an additional Mse loss is added to predict the IoU between each Mask and the Ground Truth.

Training Process:

The training was conducted using 256 GPUs, with image sizes of 1,024 x 1,024. The process followed the Interactive Segmentation training flow, randomly selecting either Point or Bounding box as the input for Prompts. The practical approach involves selecting an object from the Ground Truth, randomly picking 1 to multiple points from the Mask’s pixels, or enlarging the object’s Bounding box by 10% as input. Appropriate Noise is added to provide some flexibility for users during actual operation. The training may not predict the best result in one go, so the process sequentially adds Positive/Negative points to the input based on the previous prediction of the Mask to ensure interactive operation by users in the future.

The Loss used during training is Focal + Dice + MSE. Experiments showed that Auxiliary deep supervision did not improve the Decoder’s training results, and data augmentation techniques were not used. The primary reason is that the training dataset is already large enough. For more details, please refer to the paper’s Appendix.

Although the official release only includes the prediction code and not the training code, you can refer to the blog post “How To Fine-Tune Segment Anything,” which demonstrates fine-tuning using MSE loss.

Zero-shot Transfer

The SAM model can obtain corresponding image segmentation results by combining different Prompts. For example, after using an object bounding box to select a cat’s body, the model returns the cat’s Mask, which can be considered an Instance Segmentation result.

Experiments also show that even untrained tasks can be predicted by inputting appropriate Prompts.

Low-level visual tasks: Edge Segmentation
Mid-level visual tasks: Object Proposal Generation
High-level visual tasks: Instance Segmentation

By extracting the center points from the Ground Truth Mask to be used as Prompt input, a comparison was made across 23 different datasets. The results show that in most tasks, the SAM model outperforms RITM.

Recent Applications of SAM:

There have been numerous applications that combine Segment-Anything with image-generation AI to create different objects in specific locations.

IDEA-Research has released Grounded-Segment-Anything, which combines SAM with GroundingDINO (Openset object detection) to allow users to quickly understand which objects are in an image and display the corresponding Masks. Furthermore, it integrates image generation AI to replace objects based on Masks or change the color of specific items, such as clothing.

Inpaint Anything — Inpaint Anything: Segment Anything Meets Image Inpainting. It utilizes SAM to analyze and extract Masks, which are then combined with Stable diffusion for image generation.

Following the publication of the SAM paper, a team released SAM Fails to Segment Anything? — SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, and More. This paper points out that SAM underperforms in tasks such as camouflage and shadow detection. To address this issue, the authors propose using an Adapter method to fine-tune SAM for specific tasks. In the future, this approach can be employed to quickly adapt to different domain datasets.