Step Audio EditX for Voice Cloning
Upload a voice sample, transcribe it automatically with Whisper, then use Step-Audio EditX to clone that voice speaking your custom script. No trigger word needed.
Audio2Audio
Audio Editing
Step Audio EditX
Voice Cloning
0
66
Nodes & Models
WorkflowGraphics
LoadAudio
SaveAudioMP3
ShowText|pysssss
ShowText|pysssss
StepAudio_VoiceClone
Description:
Clone a voice from a single audio clip and make it say anything you type.
Upload a reference audio file of the voice you want to clone. The workflow transcribes it with Whisper, then feeds the transcription and audio into Step Audio EditX to generate new speech in that same voice. You get an MP3 back, ready to use.
No fine-tuning. No training data. One clip in, new speech out.
How do you clone a voice with Step Audio EditX?
Upload a short audio clip of the voice you want to clone. The workflow auto-transcribes it, pairs the transcription with the audio as a voice reference, then generates new speech from your text prompt using Step Audio EditX. The output is an MP3 file.
Reference audio (prompt_audio) This is the voice sample Step Audio EditX learns from. Use a clean, clear recording. Background noise and music reduce quality. 5 to 15 seconds works well. Longer clips don't help much and slow things down.
Text prompt (prompt_text) Type what you want the cloned voice to say. The default example is a frustrated customer service call. Replace it with whatever you need. Punctuation and sentence structure affect pacing and tone, so write naturally.
Temperature Controls how much variation the model adds to its output. Default is 0.7. Want predictable, steady delivery? Drop it to 0.3 or 0.4. Want more expressive, dynamic reads? Push it toward 0.9. Going above 1.0 adds randomness that can sound unnatural.
Max new tokens (2048) Sets the maximum length of generated audio. 2048 is the default and handles most short-to-medium scripts. If your output is getting cut off, raise this. If you want shorter clips and faster generation, lower it.
Seed Set to randomize by default, meaning each run sounds slightly different. Lock the seed to a specific number when you want to compare settings without voice variation between runs.
What is Step Audio EditX voice cloning good for?
Step Audio EditX voice cloning works best when you need consistent voice output from a single reference clip without any model training. It handles dialogue, narration, and short-form audio where matching a specific voice matters more than producing long-form content.
Voiceover prototyping is the sweet spot. You have a voice you like, and you need to hear how a script sounds in that voice before committing to a full recording session. Upload the sample, paste your script, and get a preview in seconds.
Character voice work for games, animations, or podcasts is another strong use. You can generate multiple takes with different temperatures to find the right delivery.
This workflow is less suited for long-form audiobook narration or cases where you need perfect emotional range across many paragraphs. For those, a dedicated TTS pipeline with more control over prosody will serve you better.
FAQ
What audio format works best for Step Audio EditX voice cloning? MP3 and WAV both work. Use a clean recording with minimal background noise. The model picks up on room tone and artifacts, so studio-quality or close-mic recordings give the best clones. 5 to 15 seconds of speech is enough.
How long can the generated speech be with Step Audio EditX? The max_new_tokens setting controls output length. At the default of 2048, you can generate short to medium clips. For longer scripts, increase this value, but generation time goes up with it.
Does Step Audio EditX need training data to clone a voice? No. It works from a single reference audio clip. No fine-tuning, no dataset, no training loop. Upload one sample and start generating.
What temperature should I use for Step Audio EditX voice cloning? Start at the default of 0.7 for balanced results. Lower values (0.3 to 0.5) give more monotone, predictable delivery. Higher values (0.8 to 1.0) add expressiveness but can introduce artifacts if pushed too far.
How do I run Step Audio EditX voice cloning online? You can run Step Audio EditX voice cloning online through Floyo. No installation, no setup. Open the workflow in your browser, upload your reference audio, type your script, and hit run. Free to try.
Read more
