floyo logo
Powered by
ThinkDiffusion
floyo logo
Powered by
ThinkDiffusion

Step Audio EditX for Voice Editing

Edit existing voice recordings with Step-Audio EditX. Change emotion, dialect, or style. Whisper transcribes your audio so you describe the edit, not the source.

48

Generates in about -- secs

Nodes & Models

WorkflowGraphics
LoadAudio
SaveAudioMP3
ShowText|pysssss
ShowText|pysssss
StepAudio_AudioEdit

Description:

Change how a voice sounds without re-recording it. Upload an audio clip, and Step-Audio-EditX rewrites its emotion, speaking style, speed, or paralinguistic details (breathing, laughter, sighs) while keeping the original voice intact.

This workflow auto-transcribes your audio using Whisper, so you don't need to type the transcript yourself. Pick an edit type, choose the target emotion or style, and run it. The output saves as a 320kbps MP3.

How do you edit voice emotion with Step-Audio-EditX?

Upload your audio, pick an edit type (emotion, style, speed, paralinguistic, or denoise), choose a target value like "happy" or "whisper," set your iterations, and run. The workflow auto-transcribes your clip with Whisper, so no manual transcript is needed. More iterations means a stronger effect.

Edit Type This is the main control. Options include emotion, style, speed, paralinguistic, and denoise. Each one unlocks a different set of target values. Start with emotion if you want to change how expressive the voice sounds.

Emotion / Style / Speed / Paralinguistic / Denoise Value After picking your edit type, set the matching target. For emotion: happy, sad, angry, and others. For style: whisper, serious, and more. For paralinguistic: add breathing, laughter, or sighs. For denoise: clean up background noise. Leave all other values on "none."

Iterations Controls how many editing passes the model runs on your audio. Default is 2. One pass gives a subtle shift. Two or three passes make the effect more obvious. Want a strong emotion change? Try 3. Want something light? Set it to 1.

Temperature Default is 0.7. Lower values (0.3 to 0.5) make the output more predictable and closer to the original. Higher values (0.8 to 1.0) give more variation but can drift from the source voice. 0.7 is a good starting point for most edits.

Max New Tokens Default is 2048. This sets the ceiling for output length. For clips under 15 seconds, the default works. For longer clips, you may need to increase it, but keep audio under 30 seconds per run for best results.

What is Step-Audio-EditX voice editing good for?

Step-Audio-EditX is built for changing how a voice sounds without changing what it says. It works best for short clips (under 30 seconds) where you need a specific emotion, style, or pacing shift. Iterative editing lets you dial in the effect across multiple passes.

Voice actors and content creators can take a flat read and add emotion after the fact. Record once, then run it through with "happy," "angry," or "whisper" to get multiple versions from the same take.

Podcast and video editors can clean up audio with the denoise option, or add natural-sounding paralinguistic details like breathing and laughter to make narration feel less robotic.

Localization teams can adjust pacing and tone for different markets without re-recording. The model supports Mandarin, English, Cantonese, and Sichuanese.

One thing to know: this is an editing workflow, not a voice cloning workflow. You're transforming existing audio, not generating new speech from text. For zero-shot TTS, you'd need the cloning variant of Step-Audio-EditX.

FAQ

What emotions can Step-Audio-EditX edit? The model supports emotions like happy, sad, angry, and more. You pick the target emotion from a dropdown, set your iterations, and run. Two to three iterations gives a noticeable shift without distorting the voice.

How long should my audio clip be for Step-Audio-EditX? Keep clips under 30 seconds per run for the best results. The workflow handles shorter clips more reliably. If you have longer audio, split it into segments first and run each one separately.

Can Step-Audio-EditX remove background noise? Yes. Set the edit type to "denoise" and run it. One iteration handles light noise. For heavier noise, try two iterations. It works best on speech, not music.

How many iterations should I use in Step-Audio-EditX? Start with 2 (the default). One iteration gives a subtle change. Two makes it clear. Three makes it strong. Going beyond three can start to degrade audio quality, so listen and compare as you go.

How to run Step-Audio-EditX voice editing online? You can run Step-Audio-EditX voice editing online through Floyo. No installation, no setup. Open the workflow in your browser, upload your audio, and hit run. Free to try.

Read more

N