SAM3 for Video Masking using Text
Create a video masking using SAM3 and Text only.
SAM3
Video2Video
Video Masking
1
38
SAM 3 lets you create video masks just by describing what you want to segment (for example “red car”, “main person”, “blue backpack”), then tracks all matching instances across frames.
Overview
SAM 3 is a promptable concept segmentation model: you give it short noun‑phrase text prompts and it detects, segments, and tracks every instance of that concept in a video. The video predictor streams through frames with a memory mechanism, so the same objects keep consistent IDs over time, even with occlusion or re‑appearance.
How text‑based video masking works
You call a SAM 3 video predictor with a video source and a list of text prompts, for example
["person", "bicycle"].For each frame, SAM 3 returns:
Segmentation masks for all instances matching each text concept.
Tracking IDs so instances stay linked across frames.
You can then convert these masks into binary alpha mattes or colored overlays and export them as per‑frame masks or a mask video for compositing.
Why this is useful for video masking
Open‑vocabulary: You don’t need a fixed label set—any short phrase like “yellow school bus” or “striped umbrella” can be used as a concept.
All instances, not one: Unlike earlier models, SAM 3 segments every object that matches your text, not just a single instance.
Less manual work: You avoid drawing boxes on every object or every frame; text prompts plus occasional point refinements are usually enough.
Typical use cases
Creating masks for all people, cars, or specific props in a scene to apply localized effects or color grading.
Automatically masking branded items (“logos”, “bottles”) for protection, replacement, or analytics.
Pre‑masking objects for downstream tools (virtual try‑on, character edits, background swaps) without manual rotoscoping.
Read more
Nodes & Models
VHS_LoadVideo
VHS_VideoInfo
VHS_VideoCombine
VHS_LoadVideo
VHS_VideoInfo
VHS_VideoCombine
WorkflowGraphics
ImageBlend
MaskPreview
MaskToImage
LoadSAM3Model
SAM3VideoSegmentation
SAM3Propagate
SAM3VideoOutput
SAM 3 lets you create video masks just by describing what you want to segment (for example “red car”, “main person”, “blue backpack”), then tracks all matching instances across frames.
Overview
SAM 3 is a promptable concept segmentation model: you give it short noun‑phrase text prompts and it detects, segments, and tracks every instance of that concept in a video. The video predictor streams through frames with a memory mechanism, so the same objects keep consistent IDs over time, even with occlusion or re‑appearance.
How text‑based video masking works
You call a SAM 3 video predictor with a video source and a list of text prompts, for example
["person", "bicycle"].For each frame, SAM 3 returns:
Segmentation masks for all instances matching each text concept.
Tracking IDs so instances stay linked across frames.
You can then convert these masks into binary alpha mattes or colored overlays and export them as per‑frame masks or a mask video for compositing.
Why this is useful for video masking
Open‑vocabulary: You don’t need a fixed label set—any short phrase like “yellow school bus” or “striped umbrella” can be used as a concept.
All instances, not one: Unlike earlier models, SAM 3 segments every object that matches your text, not just a single instance.
Less manual work: You avoid drawing boxes on every object or every frame; text prompts plus occasional point refinements are usually enough.
Typical use cases
Creating masks for all people, cars, or specific props in a scene to apply localized effects or color grading.
Automatically masking branded items (“logos”, “bottles”) for protection, replacement, or analytics.
Pre‑masking objects for downstream tools (virtual try‑on, character edits, background swaps) without manual rotoscoping.
Read more




