AI Computer Vision Research

Segment Anything Model (SAM): a new AI model from Meta AI that can "cut out" any object, in any image, with a single click

SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.

Try the demo

The research

SAM uses a variety of input prompts

Prompts specifying what to segment in an image allow for a wide range of segmentation tasks without the need for additional training.

Prompt it with interactive points and boxes.

Automatically segment everything in an image.

Generate multiple valid masks for ambiguous prompts.

SAM's promptable design enables flexible integration with other systems

SAM can take input prompts from other systems, such as in the future taking a user's gaze from an AR/VR headset to select an object. This footage uses our open sourced Aria pilot dataset.

Bounding box prompts from an object detector can enable text-to-object segmentation.

Extensible outputs

Output masks can be used as inputs to other AI systems. For example, object masks can be tracked in videos, enable imaging editing applications, be lifted to 3D, or used for creative tasks like collaging.

Zero-shot generalization

SAM has learned a general notion of what objects are -- this understanding enables zero-shot generalization to unfamiliar objects and images without requiring additional training.

Demonstration

Select a dog in the image

Go to demo

Training the model

SAM’s data engine

SAM's advanced capabilities are the result of its training on millions of images and masks collected through the use of a model-in-the-loop "data engine." Researchers used SAM and its data to interactively annotate images and update the model. This cycle was repeated many times over to improve both the model and the dataset.

Read the blog post

11M images, 1B+ masks

After annotating enough masks with SAM’s help, we were able to leverage SAM’s sophisticated ambiguity-aware design to annotate new images fully automatically. To do this, we present SAM with a grid of points on an image and ask SAM to segment everything at each point. Our final dataset includes more than 1.1 billion segmentation masks collected on ~11 million licensed and privacy preserving images.

Explore the dataset Download full dataset

Efficient & flexible model design

SAM is designed to be efficient enough to power its data engine. We decoupled the model into 1) a one-time image encoder and 2) a lightweight mask decoder that can run in a web-browser in just a few milliseconds per prompt.

Read the paper

Frequently asked questions

What type of prompts are supported?
Foreground/background points
Bounding box
Mask
Text prompts are explored in our paper but the capability is not released

What is the structure of the model?
A ViT-H image encoder that runs once per image and outputs an image embedding
A prompt encoder that embeds input prompts such as clicks or boxes
A lightweight transformer based mask decoder that predicts object masks from the image embedding and prompt embeddings

What platforms does the model use?
The image encoder is implemented in PyTorch and requires a GPU for efficient inference.
The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.

How big is the model?
The image encoder has 632M parameters.
The prompt encoder and mask decoder have 4M parameters.

How long does inference take?
The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU.
The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution.

What data was the model trained on?
The model was trained on our SA-1B dataset. See our dataset viewer.

How long does it take to train the model?
The model was trained for 3-5 days on 256 A100 GPUs.

Does the model produce mask labels?
No, the model predicts object masks only and does not generate labels.

Does the model work on videos?
Currently the model only supports images or individual frames from videos.

Where can I find the code?
Code is available on GitHub

Acknowledgements

Research Authors

Alexander Kirillov^1,2,4 Eric Mintun², Nikhila Ravi^1,2, Hanzi Mao², Chloe Rolland³, Laura Gustafson³, Tete Xiao³, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar⁴, Ross Girshick⁴

¹ project lead, ² joint first author, ³ equal contribution, ⁴ directional lead

Project Contributors (alphabetical):

Aaron Adcock, Vaibhav Aggarwal, Morteza Behrooz, Cheng-Yang Fu, Ashley Gabriel, Ahuva Goldstand, Allen Goodman, Sumanth Gurram, Jiabo Hu, Somya Jain, Devansh Kukreja, Robert Kuo, Joshua Lane, Yanghao Li, Lilian Luong, Jitendra Malik, Mallika Malhotra, William Ngan, Omkar Parkhi, Nikhil Raina, Dirk Rowe, Neil Seejoor, Vanessa Stark, Bala Varadarajan, Bram Wasti, Zachary Winstrom

AI Computer Vision Research

Segment Anything Model (SAM): a new AI model from Meta AI that can "cut out" any object, in any image, with a single click

SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.

The research

SAM uses a variety of input prompts

Prompts specifying what to segment in an image allow for a wide range of segmentation tasks without the need for additional training.

SAM's promptable design enables flexible integration with other systems

Extensible outputs

Output masks can be used as inputs to other AI systems. For example, object masks can be tracked in videos, enable imaging editing applications, be lifted to 3D, or used for creative tasks like collaging.

Zero-shot generalization

SAM has learned a general notion of what objects are -- this understanding enables zero-shot generalization to unfamiliar objects and images without requiring additional training.

Demonstration

Select a dog in the image

Training the model

SAM’s data engine

11M images, 1B+ masks

Efficient & flexible model design

SAM is designed to be efficient enough to power its data engine. We decoupled the model into 1) a one-time image encoder and 2) a lightweight mask decoder that can run in a web-browser in just a few milliseconds per prompt.

Frequently asked questions

What type of prompts are supported?
Foreground/background points
Bounding box
Mask
Text prompts are explored in our paper but the capability is not released

What is the structure of the model?
A ViT-H image encoder that runs once per image and outputs an image embedding
A prompt encoder that embeds input prompts such as clicks or boxes
A lightweight transformer based mask decoder that predicts object masks from the image embedding and prompt embeddings

What platforms does the model use?
The image encoder is implemented in PyTorch and requires a GPU for efficient inference.
The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.

How big is the model?
The image encoder has 632M parameters.
The prompt encoder and mask decoder have 4M parameters.

How long does inference take?
The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU.
The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution.

What data was the model trained on?
The model was trained on our SA-1B dataset. See our dataset viewer.

How long does it take to train the model?
The model was trained for 3-5 days on 256 A100 GPUs.

Does the model produce mask labels?
No, the model predicts object masks only and does not generate labels.

Does the model work on videos?
Currently the model only supports images or individual frames from videos.

Where can I find the code?
Code is available on GitHub

Acknowledgements

Alexander Kirillov^1,2,4 Eric Mintun², Nikhila Ravi^1,2, Hanzi Mao², Chloe Rolland³, Laura Gustafson³, Tete Xiao³, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar⁴, Ross Girshick⁴

Our latest updates
delivered to your inbox

Sign up to receive our newsletter and be the first to know
about Meta Al news, events, research breakthroughs,
and more.

AI Computer Vision Research

Segment Anything Model (SAM): a new AI model from Meta AI that can "cut out" any object, in any image, with a single click

SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.

The research

SAM uses a variety of input prompts

Prompts specifying what to segment in an image allow for a wide range of segmentation tasks without the need for additional training.

SAM's promptable design enables flexible integration with other systems

Extensible outputs

Output masks can be used as inputs to other AI systems. For example, object masks can be tracked in videos, enable imaging editing applications, be lifted to 3D, or used for creative tasks like collaging.

Zero-shot generalization

SAM has learned a general notion of what objects are -- this understanding enables zero-shot generalization to unfamiliar objects and images without requiring additional training.

Demonstration

Select a dog in the image

Training the model

SAM’s data engine

11M images, 1B+ masks

Efficient & flexible model design

SAM is designed to be efficient enough to power its data engine. We decoupled the model into 1) a one-time image encoder and 2) a lightweight mask decoder that can run in a web-browser in just a few milliseconds per prompt.

Frequently asked questions

What type of prompts are supported?Foreground/background pointsBounding boxMaskText prompts are explored in our paper but the capability is not released

What is the structure of the model?A ViT-H image encoder that runs once per image and outputs an image embeddingA prompt encoder that embeds input prompts such as clicks or boxesA lightweight transformer based mask decoder that predicts object masks from the image embedding and prompt embeddings

What platforms does the model use?The image encoder is implemented in PyTorch and requires a GPU for efficient inference.The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.

How big is the model?The image encoder has 632M parameters.The prompt encoder and mask decoder have 4M parameters.

How long does inference take?The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU.The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution.

What data was the model trained on?The model was trained on our SA-1B dataset. See our dataset viewer.

How long does it take to train the model?The model was trained for 3-5 days on 256 A100 GPUs.

Does the model produce mask labels?No, the model predicts object masks only and does not generate labels.

Does the model work on videos?Currently the model only supports images or individual frames from videos.

Where can I find the code?Code is available on GitHub

Acknowledgements

Alexander Kirillov1,2,4 Eric Mintun2, Nikhila Ravi1,2, Hanzi Mao2, Chloe Rolland3, Laura Gustafson3, Tete Xiao3, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar4, Ross Girshick4

Our latest updates delivered to your inbox

Sign up to receive our newsletter and be the first to know about Meta Al news, events, research breakthroughs, and more.

Feedback

What type of prompts are supported?
Foreground/background points
Bounding box
Mask
Text prompts are explored in our paper but the capability is not released

What is the structure of the model?
A ViT-H image encoder that runs once per image and outputs an image embedding
A prompt encoder that embeds input prompts such as clicks or boxes
A lightweight transformer based mask decoder that predicts object masks from the image embedding and prompt embeddings

How big is the model?
The image encoder has 632M parameters.
The prompt encoder and mask decoder have 4M parameters.

How long does inference take?
The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU.
The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution.

What data was the model trained on?
The model was trained on our SA-1B dataset. See our dataset viewer.

How long does it take to train the model?
The model was trained for 3-5 days on 256 A100 GPUs.

Does the model produce mask labels?
No, the model predicts object masks only and does not generate labels.

Does the model work on videos?
Currently the model only supports images or individual frames from videos.

Where can I find the code?
Code is available on GitHub

Alexander Kirillov^1,2,4 Eric Mintun², Nikhila Ravi^1,2, Hanzi Mao², Chloe Rolland³, Laura Gustafson³, Tete Xiao³, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar⁴, Ross Girshick⁴

Our latest updates
delivered to your inbox

Sign up to receive our newsletter and be the first to know
about Meta Al news, events, research breakthroughs,
and more.