InfiniteTalk - Lip Sync Any Video to Any Audio
vid2vid
wan
0
17
Nodes & Models
AudioEncoderLoader
wav2vec2-chinese-base_fp16.safetensors
LoadAudio
ModelPatchLoader
Wan2_1-InfiniTetalk-Single_fp16.safetensors
WorkflowGraphics
LoraLoaderModelOnly
Wan21_I2V_14B_lightx2v_cfg_step_distill_lora_rank64-wan-2-2-image-to-video-0ar6Pfb4.safetensors
CLIPTextEncode
PainterAudioCut
ImageFromBatch
PathchSageAttentionKJ
ConditioningZeroOut
AudioEncoderEncode
CLIPVisionEncode
PainterAV2V
KSamplerAdvanced
VAEDecode
ImageConcatMulti
VHS_LoadVideo
VHS_VideoCombine
VHS_LoadVideo
VHS_VideoCombine
VHS_LoadVideo
VHS_VideoCombine
easy cleanGpuUsed
InfiniteTalk takes a silent video and an audio track and syncs the speaker to the audio.
Upload any video where a person is talking (or not). Drop in the audio you want them to say. InfiniteTalk rewrites the mouth, head motion, and facial expression to match the new audio while keeping the person's identity, wardrobe, and background stable. Works for dubbing, language swaps, ADR replacement, and turning silent footage into a talking clip.
Built on Wan 2.1 I2V as the visual backbone, with InfiniteTalk driving the lip and body sync. Outputs at 832x480 by default.
How do you use InfiniteTalk for audio-driven lip sync?
Load your source video into VHS_LoadVideo. Load your audio file into LoadAudio. Keep the prompt as "person talking" or describe the scene. Hit Run. InfiniteTalk reads the audio, regenerates the speaker's mouth shapes and head motion to match, and preserves identity and background from the source video.
Input video The silent or spoken source clip. The speaker's face and body should be visible and reasonably well-lit. Default settings run at 832x480 with a 121-frame cap at 25fps, which is around 5 seconds. Bump frame_load_cap higher for longer clips. The load settings feed directly into what the sampler processes, so changing resolution here also changes generation resolution.
Input audio Drop your audio file (mp3 or wav) into LoadAudio. This drives the lip sync. Clean audio with clear speech gives the best results. Background noise and music can confuse the sync. The audio length also sets how long your final video will be, so a 10-second audio clip produces a 10-second generation (assuming frame_load_cap is set high enough).
Positive prompt "person talking" is the default and works for most cases. Describe the scene or add style cues if you want to steer the generation. The prompt has less impact here than in a pure text-to-video workflow, because InfiniteTalk is driven primarily by the video and audio inputs. Keep it short.
Resolution (PainterAV2V) 832x480 is the default. Want sharper output? Push to 720p (1280x720). The catch: higher resolution means slower renders and more VRAM. Match the aspect ratio to your source video or you'll get letterboxing.
Frame count Default is 121 frames at 25fps, around 5 seconds. For longer dubs, raise frame_load_cap in the video loader. InfiniteTalk is designed for unlimited length generation through its sparse-frame dubbing architecture, so long clips work fine. They take proportionally longer to render.
Sampling (KSamplerAdvanced) 3 steps with dpmpp_sde on the normal scheduler. Defaults are fast because the lightx2v CFG step distill LoRA cuts step count significantly. Don't raise steps unless you see artifacts. The LoRA is strength 1 by default and expects this low step count.
Seed Set to randomize by default. Got a sync you like and want to iterate on the prompt? Note the seed and flip randomize off.
What is InfiniteTalk good for?
InfiniteTalk is built for audio-driven video editing. Dubbing a video into a new language, replacing dialogue in post without a reshoot (ADR), making a silent portrait talk, or adjusting the timing of existing speech. Anywhere you need a person in a video to match new audio with natural lip, head, and body sync.
Good fit: language dubs for international releases, ADR replacement in indie film production, turning a portrait photo or silent B-roll into a talking avatar, creator content where you want the same face speaking multiple scripts, podcast visuals with an animated host. The sparse-frame architecture means you can run long clips without the talking head drifting.
Less good fit: videos where the speaker's face is heavily occluded or off-frame, clips with multiple speakers that need individual lip sync (the single-person variant used here handles one speaker), or music performances where you need accurate tongue and cheek movement matching complex vocals. For heavy singing content, expect the lip sync to be good but not perfect.
The trade-off: InfiniteTalk preserves the original video's body motion and background. If the source footage has the speaker turning away from camera or moving out of frame, the sync in those moments will be approximate. Pick source video where the face stays reasonably visible.
FAQ
What does InfiniteTalk do? InfiniteTalk is an audio-driven video generation model that syncs the speaker in a video to a new audio track. It rewrites lip shapes, head motion, and facial expression to match the audio while keeping the person's identity, wardrobe, and background stable. Useful for dubbing, ADR, and turning silent footage into talking clips.
What audio formats work with InfiniteTalk? MP3 and WAV both work. Clean speech audio with low background noise gives the best sync. Music with vocals works for singing content but gets less precise than spoken dialogue. The audio length determines the output video length, so make sure your frame_load_cap is set high enough to cover it.
How long can an InfiniteTalk video be? The architecture supports unlimited-length generation through sparse-frame dubbing that chunks the audio and maintains identity across segments. In practice, the frame_load_cap in this workflow defaults to 121 frames (5 seconds). Raise it to match your audio duration. Longer clips take proportionally longer to render.
What resolution does InfiniteTalk output? Default is 832x480. The workflow supports 720p (1280x720) by changing the PainterAV2V dimensions. Higher resolution means longer render times and more VRAM. Match the aspect ratio to your source video to avoid letterboxing.
Why does this InfiniteTalk workflow use only 3 sampling steps? The workflow includes the lightx2v CFG step distill LoRA for Wan 2.1 I2V, which lets the model converge in around 3 steps instead of the usual 20 to 30. This cuts render time dramatically. Raising the step count won't improve output with this LoRA loaded and may cause artifacts. Leave it at 3 unless you have a reason.
How to run InfiniteTalk online? You can run InfiniteTalk online through Floyo. No installation, no setup. Open the workflow in your browser, upload your inputs, and hit run.
Read more

