API

Pricing

Workflows

API

Pricing

ComfyUI-speech-dataset-toolkit

Author kale4eat

https://github.com/kale4eat/ComfyUI-speech-dataset-toolkit

Last updated

2025-06-17

Run hundreds of ComfyUI nodes and workflows in your browser.

The ComfyUI Speech Dataset Toolkit is a collection of custom nodes designed to facilitate the creation of speech datasets for applications such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) using audio processing tools. It leverages the capabilities of torchaudio to provide essential audio manipulation functionalities.

Supports a variety of audio editing operations like cutting, trimming, and resampling.
Includes visualization tools to analyze audio waveforms and spectrograms for better insights.
Integrates advanced AI models for tasks such as voice activity detection and speech recognition.

Context

The ComfyUI Speech Dataset Toolkit is an extension for ComfyUI that streamlines the process of creating and managing speech datasets. Its primary focus is on providing users with a comprehensive set of audio processing tools that enhance the workflow for speech-related AI tasks.

Key Features & Benefits

This toolkit offers practical features such as loading and saving audio files, editing capabilities (including cutting, trimming, and resampling), and visualization tools for analyzing audio data. These functionalities are crucial for users who need to prepare and refine audio datasets for machine learning applications, ensuring high-quality input for ASR and TTS systems.

Advanced Functionalities

The toolkit incorporates advanced AI models like Demucs for audio source separation and Silero VAD for voice activity detection. These specialized capabilities allow users to perform complex audio processing tasks efficiently, enabling them to extract relevant features from audio data and improve the quality of their datasets.

Practical Benefits

By integrating this toolkit into their workflow, users can significantly enhance their control over audio data manipulation, improve the quality of their speech datasets, and increase overall efficiency in the dataset creation process. The streamlined operations reduce the time and effort required to prepare audio files for machine learning.

Credits/Acknowledgments

The toolkit is developed by contributors from the open-source community, with inspiration drawn from other projects such as ComfyUI-audio and ComfyUI-AudioScheduler. It is important to acknowledge the original authors and maintainers of these resources, which have influenced the development of this toolkit.

Discover most popular workflows

Hand-picked based on what hundreds of other artists looked at.

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

Nano Banana 2: Fast Image Generation & Editing

floyoofficial

4.6k

API

gemini flash image

Image2Image

Text2Image

typography

The top-ranked image model on Artificial Analysis and LM Arena. 4K output, text rendering, and subject consistency across 5 characters.

Nano Banana 2: Fast Image Generation & Editing

The top-ranked image model on Artificial Analysis and LM Arena. 4K output, text rendering, and subject consistency across 5 characters.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

goshnii

10.7k

Face swap

Flux

flux 2 klein

Flux 2 Klein face swap

Flux face swap

head swap

image 2 image

image editing

Instead of using outdated or unstable techniques, this workflow was designed to take full advantage of FLUX 2 KLEIN's editing capabilities—using a face image and a reference character image to produce clean, highly consistent results.

Flux 2 Klein 9b - Perfect Face swap

floyoofficial

4.7k

API

Image to Video

LTX2.3

LTX 2.3

LTX 2.3 Pro Image to Video

LTX 2.3

Author

kale4eat