LTX 2.3 - Audio to Video
Generate video from an audio track and a reference image with LTX 2.3
animation
image to video
lipsync
ltx 2
video generation
2
47
Nodes & Models
LTX23AudioToVideo_floyo
VideoToFrames
LoadAudio
LoadImage
WorkflowGraphics
VHS_VideoCombine
VHS_VideoCombine
LTX 2.3 Audio to Video takes an audio file and a reference image and generates a video where the visual content moves in response to the sound.
Upload an audio track and a still image. LTX 2.3 animates the image in sync with the audio, using the rhythm, pace, and energy of the sound to drive the motion. Add a short prompt to describe the subject or scene. Adjust guidance scale to control how closely the output follows your prompt versus the audio signal.
Audio in. Animated video out.
How do you use LTX 2.3 audio to video?
Upload an audio file and a reference image. Write a short prompt describing the subject. LTX 2.3 generates a video that animates the image in sync with the audio track. Use guidance scale to balance prompt influence against audio-driven motion.
Audio The sound that drives the animation. LTX 2.3 reads the rhythm, energy, and pacing of the track to generate motion that responds to it. Music with clear beats, speech with natural cadence, and ambient audio with distinct texture all give the model usable motion cues. The cleaner and more structured the audio, the more coherent the motion sync.
Reference image The visual starting point. LTX 2.3 animates this image in response to the audio. A portrait, a person, a scene, or an object all work. The image sets the appearance and character of the video. The audio sets how it moves.
Prompt Describe the subject and what they are doing. Keep it short and visual: "Elderly man teaching mathematics", "Singer performing on stage", "Dancer moving to music." The prompt anchors the model to a subject and activity. It does not need to describe the motion in detail. That comes from the audio.
Guidance scale Controls how closely the output follows your prompt versus the audio. Higher values push the output toward the prompt description. Lower values give the audio more influence over what happens visually. Default is 5. Try 3 to 4 for more audio-reactive motion, 6 to 7 if the subject needs to stay more recognizably on-prompt.
What is LTX 2.3 audio to video good for?
Animating portraits and scenes to music or speech, generating audio-reactive visual content, and creating talking head style video from a still image and a voiceover track.
The most direct use is animating a still image of a person against a speech or music track. Upload a portrait, add the audio, describe the subject, and get a short clip where the person appears to move in time with the sound.
Music visualizers, talking head animations, and scene animations tied to voiceover or ambient audio are all within range. The model handles natural motion cues from the audio better than highly synthetic or heavily processed sound.
For precise lip-sync where phoneme accuracy matters, a dedicated lip-sync workflow will give you more control. LTX 2.3 Audio to Video generates audio-reactive motion at a broader level: rhythm, energy, and expressiveness, rather than frame-accurate mouth matching.
Read more


