We use cookies and similar technologies to help provide the content on the Segment Anything site and for analytics purposes. You can learn more about cookies and how we use them in our Cookie Policy.
AI Computer Vision Research

Segment Anything Model (SAM): a new AI model from Meta AI that can "cut out" any object, in any image, with a single click

SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.
The research

SAM uses a variety of input prompts

Prompts specifying what to segment in an image allow for a wide range of segmentation tasks without the need for additional training.

SAM's promptable design enables flexible integration with other systems

Extensible outputs

Output masks can be used as inputs to other AI systems. For example, object masks can be tracked in videos, enable imaging editing applications, be lifted to 3D, or used for creative tasks like collaging.

Zero-shot generalization

SAM has learned a general notion of what objects are -- this understanding enables zero-shot generalization to unfamiliar objects and images without requiring additional training.
Demonstration

Select a dog in the image

Go to demo
Training the model

SAM’s data engine

SAM's advanced capabilities are the result of its training on millions of images and masks collected through the use of a model-in-the-loop "data engine." Researchers used SAM and its data to interactively annotate images and update the model. This cycle was repeated many times over to improve both the model and the dataset.

11M images, 1B+ masks

After annotating enough masks with SAM’s help, we were able to leverage SAM’s sophisticated ambiguity-aware design to annotate new images fully automatically. To do this, we present SAM with a grid of points on an image and ask SAM to segment everything at each point. Our final dataset includes more than 1.1 billion segmentation masks collected on ~11 million licensed and privacy preserving images.

Efficient & flexible model design

SAM is designed to be efficient enough to power its data engine. We decoupled the model into 1) a one-time image encoder and 2) a lightweight mask decoder that can run in a web-browser in just a few milliseconds per prompt.

Frequently asked questions

What type of prompts are supported?
  • Foreground/background points
  • Bounding box
  • Mask
  • Text prompts are explored in our paper but the capability is not released
What is the structure of the model?
  • A ViT-H image encoder that runs once per image and outputs an image embedding
  • A prompt encoder that embeds input prompts such as clicks or boxes
  • A lightweight transformer based mask decoder that predicts object masks from the image embedding and prompt embeddings
What platforms does the model use?
  • The image encoder is implemented in PyTorch and requires a GPU for efficient inference.
  • The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.
How big is the model?
  • The image encoder has 632M parameters.
  • The prompt encoder and mask decoder have 4M parameters.
How long does inference take?
  • The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU.
  • The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution.
What data was the model trained on?
How long does it take to train the model?
  • The model was trained for 3-5 days on 256 A100 GPUs.
Does the model produce mask labels?
  • No, the model predicts object masks only and does not generate labels.
Does the model work on videos?
  • Currently the model only supports images or individual frames from videos.
Where can I find the code?

Acknowledgements

Our latest updates delivered to your inbox

Sign up to receive our newsletter and be the first to know about Meta Al news, events, research breakthroughs, and more.