floyo logo
Powered by
ThinkDiffusion
floyo logo
Powered by
ThinkDiffusion

SAM3 for Video Masking using Points

Create a video masking using SAM3 and Points only.

38

SAM 3 with point prompts lets you build precise, interactive video masks by clicking on objects, then having the model track and segment them through the whole clip.

Overview

SAM 3 is a unified segmentation model that supports both concept prompts (text) and visual prompts (points, boxes, masks). When you use point prompts, it behaves like an advanced “click‑to‑segment” tool: foreground clicks say “include this,” background clicks say “exclude this,” and the model refines the mask accordingly. For video, SAM 3 then propagates that mask across frames with tracking, so the same object stays masked over time.​

How point‑based video masking works

  • You load a video and select one or more frames (often the first frame or a key frame) where you click on the target object.

  • You pass those point coordinates and labels (1 = foreground, 0 = background) to SAM 3’s video predictor.

  • The model generates masks for the clicked object on that frame, and then tracks and updates those masks across the rest of the video, producing a mask (and ID) per frame.

  • You threshold or directly export these masks as per‑frame alpha mattes for compositing, background edits, or feeding into other video models.

Why use points instead of only text

  • Points give pixel‑accurate control over which instance to track when text like “car” or “person” matches multiple objects in the scene.

  • Positive and negative clicks let you quickly refine the mask (add missing regions, remove stray areas) without rewriting prompts or re‑running the whole model.

  • For difficult or unusual objects where text is ambiguous, a single click can be more reliable than open‑vocabulary detection.

Typical use cases

  • Isolating a specific character, prop, or vehicle in a crowded scene by clicking on it and tracking it through the clip.

  • Creating clean masks for VFX tasks like background replacement, localized color grading, or stylizing only the subject.​

  • Combining text and points: use text to find all “people,” then point‑click to refine or pick just one to mask and track.

Read more

N
Generates in about -- secs

Nodes & Models

WorkflowGraphics
MaskPreview
MaskToImage
LoadSAM3Model
SAM3PointCollector
SAM3VideoSegmentation
SAM3Propagate
SAM3VideoOutput
VHS_LoadVideo
VHS_VideoInfo
VHS_VideoCombine
VHS_LoadVideo
VHS_VideoInfo
VHS_VideoCombine

SAM 3 with point prompts lets you build precise, interactive video masks by clicking on objects, then having the model track and segment them through the whole clip.

Overview

SAM 3 is a unified segmentation model that supports both concept prompts (text) and visual prompts (points, boxes, masks). When you use point prompts, it behaves like an advanced “click‑to‑segment” tool: foreground clicks say “include this,” background clicks say “exclude this,” and the model refines the mask accordingly. For video, SAM 3 then propagates that mask across frames with tracking, so the same object stays masked over time.​

How point‑based video masking works

  • You load a video and select one or more frames (often the first frame or a key frame) where you click on the target object.

  • You pass those point coordinates and labels (1 = foreground, 0 = background) to SAM 3’s video predictor.

  • The model generates masks for the clicked object on that frame, and then tracks and updates those masks across the rest of the video, producing a mask (and ID) per frame.

  • You threshold or directly export these masks as per‑frame alpha mattes for compositing, background edits, or feeding into other video models.

Why use points instead of only text

  • Points give pixel‑accurate control over which instance to track when text like “car” or “person” matches multiple objects in the scene.

  • Positive and negative clicks let you quickly refine the mask (add missing regions, remove stray areas) without rewriting prompts or re‑running the whole model.

  • For difficult or unusual objects where text is ambiguous, a single click can be more reliable than open‑vocabulary detection.

Typical use cases

  • Isolating a specific character, prop, or vehicle in a crowded scene by clicking on it and tracking it through the clip.

  • Creating clean masks for VFX tasks like background replacement, localized color grading, or stylizing only the subject.​

  • Combining text and points: use text to find all “people,” then point‑click to refine or pick just one to mask and track.

Read more

N