API

Pricing

Workflows

API

Pricing

Step Audio EditX for Voice Cloning

Upload a voice sample, transcribe it automatically with Whisper, then use Step-Audio EditX to clone that voice speaking your custom script. No trigger word needed.

Audio2Audio

Audio Editing

Step Audio EditX

Voice Cloning

289

Generates in about 56 secs

floyoofficial

Nodes & Models

ComfyUI Official

WorkflowGraphics

LoadAudio

SaveAudioMP3

dspy_nodes

ShowText|pysssss

ComfyUI-Custom-Scripts

ShowText|pysssss

ComfyUI-Step_Audio_EditX_TTS

StepAudio_VoiceClone

Description:

Clone a voice from a single audio clip and make it say anything you type.

Upload a reference audio file of the voice you want to clone. The workflow transcribes it with Whisper, then feeds the transcription and audio into Step Audio EditX to generate new speech in that same voice. You get an MP3 back, ready to use.

No fine-tuning. No training data. One clip in, new speech out.

How do you clone a voice with Step Audio EditX?

Upload a short audio clip of the voice you want to clone. The workflow auto-transcribes it, pairs the transcription with the audio as a voice reference, then generates new speech from your text prompt using Step Audio EditX. The output is an MP3 file.

Reference audio (prompt_audio) This is the voice sample Step Audio EditX learns from. Use a clean, clear recording. Background noise and music reduce quality. 5 to 15 seconds works well. Longer clips don't help much and slow things down.

Text prompt (prompt_text) Type what you want the cloned voice to say. The default example is a frustrated customer service call. Replace it with whatever you need. Punctuation and sentence structure affect pacing and tone, so write naturally.

Temperature Controls how much variation the model adds to its output. Default is 0.7. Want predictable, steady delivery? Drop it to 0.3 or 0.4. Want more expressive, dynamic reads? Push it toward 0.9. Going above 1.0 adds randomness that can sound unnatural.

Max new tokens (2048) Sets the maximum length of generated audio. 2048 is the default and handles most short-to-medium scripts. If your output is getting cut off, raise this. If you want shorter clips and faster generation, lower it.

Seed Set to randomize by default, meaning each run sounds slightly different. Lock the seed to a specific number when you want to compare settings without voice variation between runs.

What is Step Audio EditX voice cloning good for?

Step Audio EditX voice cloning works best when you need consistent voice output from a single reference clip without any model training. It handles dialogue, narration, and short-form audio where matching a specific voice matters more than producing long-form content.

Voiceover prototyping is the sweet spot. You have a voice you like, and you need to hear how a script sounds in that voice before committing to a full recording session. Upload the sample, paste your script, and get a preview in seconds.

Character voice work for games, animations, or podcasts is another strong use. You can generate multiple takes with different temperatures to find the right delivery.

This workflow is less suited for long-form audiobook narration or cases where you need perfect emotional range across many paragraphs. For those, a dedicated TTS pipeline with more control over prosody will serve you better.

FAQ

What audio format works best for Step Audio EditX voice cloning? MP3 and WAV both work. Use a clean recording with minimal background noise. The model picks up on room tone and artifacts, so studio-quality or close-mic recordings give the best clones. 5 to 15 seconds of speech is enough.

How long can the generated speech be with Step Audio EditX? The max_new_tokens setting controls output length. At the default of 2048, you can generate short to medium clips. For longer scripts, increase this value, but generation time goes up with it.

Does Step Audio EditX need training data to clone a voice? No. It works from a single reference audio clip. No fine-tuning, no dataset, no training loop. Upload one sample and start generating.

What temperature should I use for Step Audio EditX voice cloning? Start at the default of 0.7 for balanced results. Lower values (0.3 to 0.5) give more monotone, predictable delivery. Higher values (0.8 to 1.0) add expressiveness but can introduce artifacts if pushed too far.

How do I run Step Audio EditX voice cloning online? You can run Step Audio EditX voice cloning online through Floyo. No installation, no setup. Open the workflow in your browser, upload your reference audio, type your script, and hit run. Free to try.

Discover more workflows

You might like these too.

floyoofficial

217

Audio2Audio

Step Audio EditX

Voice Editing

Edit existing voice recordings with Step-Audio EditX. Change emotion, dialect, or style. Whisper transcribes your audio so you describe the edit, not the source.

Step Audio EditX for Voice Editing

Edit existing voice recordings with Step-Audio EditX. Change emotion, dialect, or style. Whisper transcribes your audio so you describe the edit, not the source.

Voice Changer using TTS Audio Suite (ChatterBox)

floyoofficial

772

audio

Audio2Audio

Chatterbox

tts

TTS Audio Suite

voice conversion

Convert any voice to match a target speaker using ChatterBox TTS. Upload source and narrator audio, run it, get back a converted MP3. No voice training needed.

Voice Changer using TTS Audio Suite (ChatterBox)

Convert any voice to match a target speaker using ChatterBox TTS. Upload source and narrator audio, run it, get back a converted MP3. No voice training needed.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Fast LoRA Training for Flux via Floyo API

floyoofficial

4.4k

API

Flux

LoRa Training

FLUX is great at generating images, but locking in a specific aesthetic or character is easier with a  LoRA. Here's how to create your own.

Fast LoRA Training for Flux via Floyo API

FLUX is great at generating images, but locking in a specific aesthetic or character is easier with a  LoRA. Here's how to create your own.

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images