LTX-2.3 Pro - Image to Video with Lipsync
Animate a still image with LTX-2.3 Pro. Build a prompt from four parts (scene, dialog, ambient, style) and get a clip with lip-synced dialog and audio.
API
LTX2.3
0
73
Nodes & Models
LTX23ProImageToVideo_floyo
MarkdownNote
LoadImage
StringConstantMultiline
JoinStrings
PreviewAny
Turn a still photo into a short cinematic video with LTX-2.3 Pro. Upload a first frame, write your prompt in four parts (scene, dialog, ambient sound, style), and pick a duration. You get back a 3, 6, or 9 second clip with synthesized audio and lip-synced dialog if you want it. Optional end frame lets you interpolate from image A to image B.
How do you prompt LTX-2.3 Pro for image-to-video with lipsync?
Break your prompt into four fragments: CORE (scene, subject, action, camera, lighting), DIALOG (exact spoken line plus voice description), AMBIENT (non-dialog sound like breeze or traffic), and STYLE (look, grain, negative constraints). They get joined into one structured prompt. Set generate_audio to True and LTX-2.3 Pro synthesizes matching audio with lip-synced mouth motion.
First frame (required) The image you want to animate. Anything works: portraits, products, characters, scenes. Clear framing and good lighting in the reference carry through to the video.
End frame (optional) Enable with Ctrl+B to do A to B interpolation. Want a subject to turn their head, change expression, or move across a room? Give a second image as the end state and LTX-2.3 Pro fills in the motion between.
CORE prompt The backbone. Describe scene, subject identity, action beats, camera movement, and lighting. LTX-2 was trained on structured, verbose prompts. Short poetic prompts underperform. Write it like a shot list: "Scene: outdoor selfie video. Subject: maintain exact appearance. Action: smiles, leans closer, adjusts hair. Camera: handheld, micro-zooms. Lighting: bright natural sun."
DIALOG (enables lipsync) Write the line verbatim in quotes: saying: "Your line here." Match dialog length to clip duration. Roughly 15 syllables per 3 seconds at normal pace. Describe voice register, pace, and emotion. Mention mouth behavior if precision matters ("jaw opens on vowels, lips round on O sounds"). Bypass this fragment with Ctrl+B if you want a silent clip.
AMBIENT audio Room tone, wind, traffic, UI beeps, light music. Keep it to one or two elements. Stacking multiple audio layers garbles the output. If DIALOG is doing heavy lifting, keep AMBIENT minimal.
STYLE fragment Look-and-feel anchors plus negative constraints. "Natural skin texture, pore detail, film grain, no text overlays, no CGI look." Drop in here the things you do not want to see. Bypass if your CORE already nails the aesthetic.
Duration (default 6) Options are 3, 6, or 9 seconds. Start with 3 for iteration. 3 second API turnaround is around 30 seconds, so you can A/B test prompts fast. Move to 6 or 9 once the short version looks right.
Resolution (default 1080p) 1080p is the sweet spot for social and web. Drop lower if you are testing motion and want faster runs.
Aspect ratio (default auto) Auto inherits from your first frame. Override if you want to force vertical, square, or widescreen regardless of the input image.
FPS (default 25) 25 reads as cinematic. 30 reads as standard video. Leave it on 25 unless you have a specific delivery spec.
generate_audio (default True) On for final renders and lipsync. Off while you dial in motion. Off is faster and cheaper, so use it for every iteration pass and flip it back on at the end.
What is LTX-2.3 Pro image-to-video good for?
Social reels from a photo, talking-head clips with synthesized dialog, product shots that come to life, character animation tests, and any project where you need a still to move with sound. Strongest on 3 to 9 second clips with clear motion beats and a defined subject. Structured prompts beat clever ones.
Best for creators making influencer-style reels where a photo needs to speak and react. Marketers who want product stills animated with ambient sound. Pre-viz work where you need a character performance before booking a shoot. A to B interpolation scenes where a subject moves between two posed references.
Less useful for long-form narrative video, complex multi-character dialog, or anything that needs precise frame-level control. If your subject identity drifts, strengthen the "maintain appearance exactly as shown in reference" line in CORE. If the face goes frozen, add "subtle breathing, occasional blink, micro-expressions" to the same fragment.
FAQ
How long can LTX-2.3 Pro image-to-video clips be? Duration options are 3, 6, or 9 seconds per run. For iteration, stick with 3 seconds because the API turnaround is around 30 seconds. Once motion and prompt are dialed in, bump to 6 or 9 for the final render. For longer sequences, generate multiple clips and stitch them in an editor.
Does LTX-2.3 Pro support lipsync? Yes. Write your dialog line in quotes inside the DIALOG fragment, describe voice qualities (register, pace, emotion), and set generate_audio to True. LTX-2.3 Pro synthesizes matching audio with lip-synced mouth motion. There is no separate audio input slot, so dialog is driven entirely by the prompt description.
How do I stop the face from looking frozen in LTX-2.3 Pro? Add subtle motion cues to your CORE prompt: "subtle breathing, occasional blink, micro-expressions, slight head tilt." Frozen faces usually come from prompts that describe a pose without describing life. Name the small movements you expect and the model delivers them.
How do I get A to B interpolation with LTX-2.3 Pro? Enable the optional end frame node with Ctrl+B and load a second image. LTX-2.3 Pro interpolates motion between the first frame and the end frame. Use this for head turns, expression changes, or any scene where you have two posed references and want the motion between them.
How do you run LTX-2.3 Pro online? You can run LTX-2.3 Pro online through Floyo. No installation, no setup. Open the workflow in your browser, upload your first frame, fill in the prompt fragments, hit run.
Read more

