SAM 3: Meta's Segment Anything Now Understands Text
Meta’s Segment Anything Model changed computer vision when it launched—point at something, get a perfect mask. SAM 2 added video tracking. Now SAM 3 adds the capability everyone wanted: text prompts.
Instead of clicking on objects, you can now describe what you want to segment in plain English.
What’s New
SAM 3 (“Segment Anything with Concepts”) introduces:
Text prompts — Describe objects with short noun phrases:
- “all the cars”
- “red flowers”
- “people wearing hats”
Exemplar prompts — Draw a box around one example, and SAM 3 finds all similar objects across the image or video.
Massive vocabulary — Trained on 4 million unique concepts, handling 270K concepts in their benchmark—50x more than existing segmentation datasets.
The Numbers
SAM 3 achieves 75-80% of human performance on their new SA-CO benchmark for open-vocabulary segmentation. That’s remarkable for a task that previously required either:
- Fixed label sets (COCO’s 80 classes)
- Manual point/box prompts for each object
Getting Started
pip install sam3
Basic text-prompted segmentation:
from sam3 import SAM3Predictor
predictor = SAM3Predictor.from_pretrained("sam3-large")
# Load image
image = predictor.load_image("photo.jpg")
# Segment by text
masks = predictor.predict(
image,
text_prompt="dogs"
)
For video:
# Track all instances of a concept through video
video_masks = predictor.predict_video(
video_path="clip.mp4",
text_prompt="bicycles"
)
Architecture
SAM 3 unifies:
- Image segmentation (like SAM 1)
- Video tracking (like SAM 2)
- Open-vocabulary detection (like CLIP-based detectors)
All in one model that handles text, points, boxes, and masks as prompts interchangeably.
Why This Matters
Before SAM 3, open-vocabulary segmentation required stitching together multiple models—a detector, a segmenter, maybe a tracker. Results were inconsistent and slow.
SAM 3 is end-to-end: describe what you want, get masks. This enables:
- Zero-shot labeling — Annotate datasets by describing objects
- Natural language video editing — “Remove all the logos”
- Accessible CV tools — Non-experts can segment without learning to prompt
- Agent vision — AI agents can now “see” any concept they can describe
My Take
This is the vision model I’ve been waiting for. SAM 1 was impressive but required knowing where objects were. SAM 2 added temporal consistency. SAM 3 finally closes the loop—you can describe what you want in natural language.
For anyone building vision applications, this should be your default starting point. The combination of text understanding + video tracking + massive concept vocabulary makes it genuinely general-purpose.
Links: