SAM3 for Video Masking using Points
Create a video masking using SAM3 and Points only.
SAM3
Video2Video
Video Masking
0
38
SAM 3 with point prompts lets you build precise, interactive video masks by clicking on objects, then having the model track and segment them through the whole clip.
Overview
SAM 3 is a unified segmentation model that supports both concept prompts (text) and visual prompts (points, boxes, masks). When you use point prompts, it behaves like an advanced “click‑to‑segment” tool: foreground clicks say “include this,” background clicks say “exclude this,” and the model refines the mask accordingly. For video, SAM 3 then propagates that mask across frames with tracking, so the same object stays masked over time.
How point‑based video masking works
You load a video and select one or more frames (often the first frame or a key frame) where you click on the target object.
You pass those point coordinates and labels (1 = foreground, 0 = background) to SAM 3’s video predictor.
The model generates masks for the clicked object on that frame, and then tracks and updates those masks across the rest of the video, producing a mask (and ID) per frame.
You threshold or directly export these masks as per‑frame alpha mattes for compositing, background edits, or feeding into other video models.
Why use points instead of only text
Points give pixel‑accurate control over which instance to track when text like “car” or “person” matches multiple objects in the scene.
Positive and negative clicks let you quickly refine the mask (add missing regions, remove stray areas) without rewriting prompts or re‑running the whole model.
For difficult or unusual objects where text is ambiguous, a single click can be more reliable than open‑vocabulary detection.
Typical use cases
Isolating a specific character, prop, or vehicle in a crowded scene by clicking on it and tracking it through the clip.
Creating clean masks for VFX tasks like background replacement, localized color grading, or stylizing only the subject.
Combining text and points: use text to find all “people,” then point‑click to refine or pick just one to mask and track.
Read more
Nodes & Models
WorkflowGraphics
MaskPreview
MaskToImage
LoadSAM3Model
SAM3PointCollector
SAM3VideoSegmentation
SAM3Propagate
SAM3VideoOutput
VHS_LoadVideo
VHS_VideoInfo
VHS_VideoCombine
VHS_LoadVideo
VHS_VideoInfo
VHS_VideoCombine
SAM 3 with point prompts lets you build precise, interactive video masks by clicking on objects, then having the model track and segment them through the whole clip.
Overview
SAM 3 is a unified segmentation model that supports both concept prompts (text) and visual prompts (points, boxes, masks). When you use point prompts, it behaves like an advanced “click‑to‑segment” tool: foreground clicks say “include this,” background clicks say “exclude this,” and the model refines the mask accordingly. For video, SAM 3 then propagates that mask across frames with tracking, so the same object stays masked over time.
How point‑based video masking works
You load a video and select one or more frames (often the first frame or a key frame) where you click on the target object.
You pass those point coordinates and labels (1 = foreground, 0 = background) to SAM 3’s video predictor.
The model generates masks for the clicked object on that frame, and then tracks and updates those masks across the rest of the video, producing a mask (and ID) per frame.
You threshold or directly export these masks as per‑frame alpha mattes for compositing, background edits, or feeding into other video models.
Why use points instead of only text
Points give pixel‑accurate control over which instance to track when text like “car” or “person” matches multiple objects in the scene.
Positive and negative clicks let you quickly refine the mask (add missing regions, remove stray areas) without rewriting prompts or re‑running the whole model.
For difficult or unusual objects where text is ambiguous, a single click can be more reliable than open‑vocabulary detection.
Typical use cases
Isolating a specific character, prop, or vehicle in a crowded scene by clicking on it and tracking it through the clip.
Creating clean masks for VFX tasks like background replacement, localized color grading, or stylizing only the subject.
Combining text and points: use text to find all “people,” then point‑click to refine or pick just one to mask and track.
Read more




