API

Pricing

Workflows

API

Pricing

Step Audio EditX for Voice Editing

Edit existing voice recordings with Step-Audio EditX. Change emotion, dialect, or style. Whisper transcribes your audio so you describe the edit, not the source.

Audio2Audio

Step Audio EditX

Voice Editing

217

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

WorkflowGraphics

LoadAudio

SaveAudioMP3

dspy_nodes

ShowText|pysssss

ComfyUI-Custom-Scripts

ShowText|pysssss

ComfyUI-Step_Audio_EditX_TTS

StepAudio_AudioEdit

Description:

Change how a voice sounds without re-recording it. Upload an audio clip, and Step-Audio-EditX rewrites its emotion, speaking style, speed, or paralinguistic details (breathing, laughter, sighs) while keeping the original voice intact.

This workflow auto-transcribes your audio using Whisper, so you don't need to type the transcript yourself. Pick an edit type, choose the target emotion or style, and run it. The output saves as a 320kbps MP3.

How do you edit voice emotion with Step-Audio-EditX?

Upload your audio, pick an edit type (emotion, style, speed, paralinguistic, or denoise), choose a target value like "happy" or "whisper," set your iterations, and run. The workflow auto-transcribes your clip with Whisper, so no manual transcript is needed. More iterations means a stronger effect.

Edit Type This is the main control. Options include emotion, style, speed, paralinguistic, and denoise. Each one unlocks a different set of target values. Start with emotion if you want to change how expressive the voice sounds.

Emotion / Style / Speed / Paralinguistic / Denoise Value After picking your edit type, set the matching target. For emotion: happy, sad, angry, and others. For style: whisper, serious, and more. For paralinguistic: add breathing, laughter, or sighs. For denoise: clean up background noise. Leave all other values on "none."

Iterations Controls how many editing passes the model runs on your audio. Default is 2. One pass gives a subtle shift. Two or three passes make the effect more obvious. Want a strong emotion change? Try 3. Want something light? Set it to 1.

Temperature Default is 0.7. Lower values (0.3 to 0.5) make the output more predictable and closer to the original. Higher values (0.8 to 1.0) give more variation but can drift from the source voice. 0.7 is a good starting point for most edits.

Max New Tokens Default is 2048. This sets the ceiling for output length. For clips under 15 seconds, the default works. For longer clips, you may need to increase it, but keep audio under 30 seconds per run for best results.

What is Step-Audio-EditX voice editing good for?

Step-Audio-EditX is built for changing how a voice sounds without changing what it says. It works best for short clips (under 30 seconds) where you need a specific emotion, style, or pacing shift. Iterative editing lets you dial in the effect across multiple passes.

Voice actors and content creators can take a flat read and add emotion after the fact. Record once, then run it through with "happy," "angry," or "whisper" to get multiple versions from the same take.

Podcast and video editors can clean up audio with the denoise option, or add natural-sounding paralinguistic details like breathing and laughter to make narration feel less robotic.

Localization teams can adjust pacing and tone for different markets without re-recording. The model supports Mandarin, English, Cantonese, and Sichuanese.

One thing to know: this is an editing workflow, not a voice cloning workflow. You're transforming existing audio, not generating new speech from text. For zero-shot TTS, you'd need the cloning variant of Step-Audio-EditX.

FAQ

What emotions can Step-Audio-EditX edit? The model supports emotions like happy, sad, angry, and more. You pick the target emotion from a dropdown, set your iterations, and run. Two to three iterations gives a noticeable shift without distorting the voice.

How long should my audio clip be for Step-Audio-EditX? Keep clips under 30 seconds per run for the best results. The workflow handles shorter clips more reliably. If you have longer audio, split it into segments first and run each one separately.

Can Step-Audio-EditX remove background noise? Yes. Set the edit type to "denoise" and run it. One iteration handles light noise. For heavier noise, try two iterations. It works best on speech, not music.

How many iterations should I use in Step-Audio-EditX? Start with 2 (the default). One iteration gives a subtle change. Two makes it clear. Three makes it strong. Going beyond three can start to degrade audio quality, so listen and compare as you go.

How to run Step-Audio-EditX voice editing online? You can run Step-Audio-EditX voice editing online through Floyo. No installation, no setup. Open the workflow in your browser, upload your audio, and hit run. Free to try.

Discover more workflows

You might like these too.

floyoofficial

290

Audio2Audio

Audio Editing

Step Audio EditX

Voice Cloning

Upload a voice sample, transcribe it automatically with Whisper, then use Step-Audio EditX to clone that voice speaking your custom script. No trigger word needed.

Step Audio EditX for Voice Cloning

Upload a voice sample, transcribe it automatically with Whisper, then use Step-Audio EditX to clone that voice speaking your custom script. No trigger word needed.

Voice Changer using TTS Audio Suite (ChatterBox)

floyoofficial

774

audio

Audio2Audio

Chatterbox

tts

TTS Audio Suite

voice conversion

Convert any voice to match a target speaker using ChatterBox TTS. Upload source and narrator audio, run it, get back a converted MP3. No voice training needed.

Voice Changer using TTS Audio Suite (ChatterBox)

Convert any voice to match a target speaker using ChatterBox TTS. Upload source and narrator audio, run it, get back a converted MP3. No voice training needed.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Fast LoRA Training for Flux via Floyo API

floyoofficial

4.4k

API

Flux

LoRa Training

FLUX is great at generating images, but locking in a specific aesthetic or character is easier with a  LoRA. Here's how to create your own.

Fast LoRA Training for Flux via Floyo API

FLUX is great at generating images, but locking in a specific aesthetic or character is easier with a  LoRA. Here's how to create your own.

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images