Create with Alibaba Happy Horse model now! Try here 👉

Pricing

Create with Alibaba Happy Horse model now! Try here 👉

LTX 2.3 Text to Video and Image to Video

Generate video and audio together with LTX 2.3 22B. Switch between text-to-video and image-to-video with one toggle. Separate multimodal guidance keeps video and audio quality tuned independently.

Image to video

LTX

text to video

Video generation

436

_MConverter.eu_5d139a70-4f33-4d4f-a0fc-76aeea88bd6e (1)_1774241709541.webp

Generates in about -- secs

jacob

Nodes & Models

ComfyUI Official

RandomNoise

ManualSigmas

KSamplerSelect

GuiderParameters

LTXAVTextEncoderLoader

gemma_3_12B_it.safetensors

ltx-2.3/ltx-2.3-22b-dev.safetensors

LTXVAudioVAELoader

ltx-2.3/ltx-2.3-22b-dev.safetensors

CheckpointLoaderSimple

ltx-2.3/ltx-2.3-22b-dev.safetensors

PrimitiveBoolean

LoadImage

PrimitiveFloat

PrimitiveInt

WorkflowGraphics

CLIPTextEncode

LoraLoaderModelOnly

ltx-2.3-22b-distilled-lora-384.safetensors

ImageResizeKJv2

EmptyLTXVLatentVideo

LTXVConditioning

LTXVPreprocess

LTXVEmptyLatentAudio

CFGGuider

MultimodalGuider

LTXVImgToVideoConditionOnly

LTXVConcatAVLatent

SamplerCustomAdvanced

LTXVScheduler

LTXVSeparateAVLatent

VAEDecodeTiled

LTXVAudioVAEDecode

CreateVideo

SaveVideo

RES4LYF

ClownSampler_Beta

ComfyMath

CM_FloatToInt

Upload an image or skip it and write a prompt — LTX 2.3 22B generates video and audio in a single run. A boolean toggle switches between image-to-video and text-to-video mode. No separate audio step needed.

The pipeline uses a multimodal guider that tunes video and audio quality independently. CFG, spatial guidance, and rescaling are set per modality, so you're not making tradeoffs between the two. Default output is 121 frames at 24fps, 960×544.

How do you use LTX 2.3 with multimodal audio and video generation?

Set the mode toggle, upload an image if you're doing image-to-video, write your prompt, and run. The multimodal guider handles video and audio guidance separately. Most users only need to touch the prompt, the image input, the mode toggle, and the seed.

Mode toggle (bypass) True = text-to-video. False = image-to-video. Default is True. Flip it to False and upload an image to anchor the first frame.

Input image Only active in image-to-video mode. Upload a still and the model animates from it. Image strength is set to 0.7 — the generated motion stays close to the source frame while allowing natural movement.

Positive prompt Describe the scene, action, and any audio you want. The example in the workflow describes a Japanese tea ceremony with specific sounds — bamboo whisk, iron kettle, koto music. The more specific you are about both visual action and sound, the better the multimodal output tracks your intent.

Negative prompt Defaults to excluding video game aesthetics, cartoons, and poor quality. Add anything else you want to keep out of the output.

Steps 15 by default via the LTX scheduler. Enough for clean output with the distilled LoRA active. Increasing steps adds detail at the cost of generation time.

Video CFG 3 by default. Controls how closely the video follows the prompt. Higher values push the model to follow your description more strictly — useful for complex scenes. Too high and motion can get rigid.

Audio CFG 7 by default. Set higher than video deliberately — audio benefits from stronger guidance to produce coherent sound that matches the described scene. Adjust down if audio feels over-produced.

Seed 42 by default, fixed. Change it to explore different motion and audio variations. Fix it again once you find something worth refining.

Frame length 121 frames at 24fps — about 5 seconds. Increase for longer clips. Generation time scales with frame count.

What is LTX 2.3 multimodal video generation good for?

LTX 2.3 with multimodal guidance is for scenes where sound is part of the brief, not an afterthought. The separate video and audio guiders mean you can push audio quality without destabilizing the video, or vice versa. Image-to-video mode adds a starting frame anchor for scenes where composition matters.

Good scenarios: scenes with specific diegetic sound — music, ambient environment, dialogue, rhythmic action. Image-to-video for product or character shots where the first frame needs to be exact. Text-to-video for scene generation where you're describing both what's seen and what's heard.

The distilled LoRA runs at different strengths across the two passes (0.2 and 0.5), which balances speed and output quality. If you're iterating on prompts, the lower-strength pass gets you results faster. Once the motion feels right, the pipeline uses the higher-strength pass to refine.

FAQ

How do I switch between text-to-video and image-to-video in this workflow? There's a boolean toggle. True runs text-to-video — no image needed. False activates image-to-video and uses your uploaded image as the starting frame. Everything else in the workflow stays the same.

Why are video and audio CFG set to different values? They govern different things. Video CFG at 3 keeps motion natural and avoids rigidity. Audio CFG at 7 pushes the model harder to generate coherent, scene-matched sound. Multimodal guidance lets you tune them independently so one doesn't compromise the other.

What kind of prompts work best for audio generation in LTX 2.3? Describe sound the same way you describe visuals — specifically. Name instruments, materials, environments, and actions that produce sound. "Bamboo whisk tapping against ceramic" gives the model something concrete. "Ambient music" does not.

Does LTX 2.3 image-to-video preserve the composition of the input image? Yes. Image strength is set to 0.7, which keeps the generated motion anchored close to the source frame. Lower it toward 0.5 if you want the model to move further from the starting composition.

How do you run LTX 2.3 text to video and image to video online? You can run it online through Floyo. No installation, no setup. Open the workflow in your browser, write your prompt, and hit run.