floyo logobeta logo
Powered by
ThinkDiffusion
Lock in a year of flow. Get 50% off your first year. Limited time offer. Claim now ⏰
floyo logobeta logo
Powered by
ThinkDiffusion
Lock in a year of flow. Get 50% off your first year. Limited time offer. Claim now ⏰

Z-Image Turbo: Fast Image Generation in Seconds

2.1k

Alibaba's team released Z-Image Turbo, and the standout feature is speed. This 6-billion parameter model generates images in 8-30 seconds on consumer GPUs while maintaining quality that rivals much larger models.

Run Z-Image Turbo directly on Floyo - no installation, no local GPU required. Generate images in your browser in seconds.

How Fast Is Z-Image Turbo?

Takes only a few seconds (between 3 seconds to 10 seconds) on Floyo!

Real generation times from Reddit users running locally:

  • RTX 3060 (12GB): ~30 seconds for 1024x1024

  • RTX 3080 Ti: 17-22 seconds for 1280x1024

  • RTX 7900XT: 8 seconds for 1024x1024

  • RTX 4070: 3.4 seconds for 1280x800

The model achieves these speeds by requiring only 8 inference steps compared to 20-50 steps for most modern image models.

Z-Image Turbo Specifications

  • Parameters: 6 billion

  • VRAM Requirements: 16GB for local use (not needed on Floyo)

  • Inference Steps: 8 steps

  • Generation Time: Sub-second on H800 GPUs, 8-30 seconds on consumer hardware

  • License: Apache-2.0 (fully open source)

  • Architecture: Single-Stream Diffusion Transformer (S3-DiT)

  • Developer: Alibaba Group

What Z-Image Turbo Does Well

Photorealistic Quality: The model produces natural-looking images with realistic skin textures. One Reddit user noted: "i have to say im really liking the natural look out of the box. It seems more like proper photos when going for that without the need for those camera loras."

According to the official site, Z-Image delivers "Photography-level realism with fine control over details, lighting, and textures. Achieves excellent aesthetic quality in composition and overall mood."

Bilingual Text Rendering: Z-Image Turbo can generate readable English and Chinese text within images - a feature most image models struggle with. The official docs confirm it excels at "accurately rendering complex Chinese and English text while preserving facial realism and overall aesthetic composition."

World Knowledge: According to the documentation, Z-Image "possesses vast understanding of world knowledge and diverse cultural concepts" and "uses structured reasoning to inject logic and common sense."

Uncensored Output: The model doesn't refuse common generation requests, though it has limitations with certain anatomical features.

From Reddit: "Finally, after SDXL we have a model that can generate proper eyelashes and non-plastic skin at the same time." -  u/Toclick

Known Limitations

Prompt Consistency: Different seeds can produce similar results for the same prompt, particularly with facial features. One user observed: "Changing the prompt and seed often makes very little difference."

Anatomical Accuracy: While the model handles female anatomy well, it struggles with male anatomy.

Text Encoding Speed: Initial prompt encoding can take up to a minute when changing prompts. Workaround from Reddit: "Setting the text encode to cpu instead of default increased the speed for me."

Artistic Range: The Turbo version prioritizes photorealism over stylistic variety compared to heavily fine-tuned models.

What Reddit Users Say About Z-Image Turbo

"I love this model, I'm speechless. It's the one we've all been waiting for... It's fast (3.4 seconds for 1280*800), powerful (painters and drawers styles etc.), lightweight compared to flux.2 and not censored." - u/Kaduc21

"The output is amazing for a 6b distilled model. Training a bunch of Loras and merging them with the base model would improve it a lot." - u/Shockbum

"It reminds Stable Diffusion 1.5 at the release, but better. Same freedom, no constraints." - u/Kaduc21

"Speed to aesthetic quality ratio is excellent." - u/abnormal_human

"WOW! SDXL SUCCESSOR!" - u/Shockbum

"I really hope so. I still prefer SDXL over any other newer model. It's just easier to iterate on and make a variety of pictures instead of waiting a minute or so per image" — u/SoulTrack

Z-Image Model Variants

Z-Image-Turbo (available now): "A distilled version of Z-Image with strong capabilities in photorealistic image generation, accurate rendering of both Chinese and English text, and robust adherence to bilingual instructions. It achieves performance comparable to or exceeding leading competitors with only 8 steps." Run it now on Floyo.

Z-Image-Base (coming soon): "The non-distilled foundation model. By releasing this checkpoint, we aim to unlock the full potential for community-driven fine-tuning and custom development."

Z-Image-Edit (coming soon): "A continued-training variant of Z-Image specialized for image editing. It excels at following complex instructions to perform a wide range of tasks, from precise local modifications to global style transformations, while maintaining high edit consistency."


Technical Architecture

Z-Image uses a Scalable Single-Stream DiT (S3-DiT) architecture where text, visual semantic tokens, and image VAE tokens are concatenated into one unified input stream. This approach is more parameter-efficient than dual-stream architectures.

The official docs explain: "In this setup, text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches."

The speed comes from Decoupled-DMD (Distribution Matching Distillation) - a technique that compresses the larger base model while preserving quality for few-step generation.

Read more

N
Generates in about 10 secs

Nodes & Models

Alibaba's team released Z-Image Turbo, and the standout feature is speed. This 6-billion parameter model generates images in 8-30 seconds on consumer GPUs while maintaining quality that rivals much larger models.

Run Z-Image Turbo directly on Floyo - no installation, no local GPU required. Generate images in your browser in seconds.

How Fast Is Z-Image Turbo?

Takes only a few seconds (between 3 seconds to 10 seconds) on Floyo!

Real generation times from Reddit users running locally:

  • RTX 3060 (12GB): ~30 seconds for 1024x1024

  • RTX 3080 Ti: 17-22 seconds for 1280x1024

  • RTX 7900XT: 8 seconds for 1024x1024

  • RTX 4070: 3.4 seconds for 1280x800

The model achieves these speeds by requiring only 8 inference steps compared to 20-50 steps for most modern image models.

Z-Image Turbo Specifications

  • Parameters: 6 billion

  • VRAM Requirements: 16GB for local use (not needed on Floyo)

  • Inference Steps: 8 steps

  • Generation Time: Sub-second on H800 GPUs, 8-30 seconds on consumer hardware

  • License: Apache-2.0 (fully open source)

  • Architecture: Single-Stream Diffusion Transformer (S3-DiT)

  • Developer: Alibaba Group

What Z-Image Turbo Does Well

Photorealistic Quality: The model produces natural-looking images with realistic skin textures. One Reddit user noted: "i have to say im really liking the natural look out of the box. It seems more like proper photos when going for that without the need for those camera loras."

According to the official site, Z-Image delivers "Photography-level realism with fine control over details, lighting, and textures. Achieves excellent aesthetic quality in composition and overall mood."

Bilingual Text Rendering: Z-Image Turbo can generate readable English and Chinese text within images - a feature most image models struggle with. The official docs confirm it excels at "accurately rendering complex Chinese and English text while preserving facial realism and overall aesthetic composition."

World Knowledge: According to the documentation, Z-Image "possesses vast understanding of world knowledge and diverse cultural concepts" and "uses structured reasoning to inject logic and common sense."

Uncensored Output: The model doesn't refuse common generation requests, though it has limitations with certain anatomical features.

From Reddit: "Finally, after SDXL we have a model that can generate proper eyelashes and non-plastic skin at the same time." -  u/Toclick

Known Limitations

Prompt Consistency: Different seeds can produce similar results for the same prompt, particularly with facial features. One user observed: "Changing the prompt and seed often makes very little difference."

Anatomical Accuracy: While the model handles female anatomy well, it struggles with male anatomy.

Text Encoding Speed: Initial prompt encoding can take up to a minute when changing prompts. Workaround from Reddit: "Setting the text encode to cpu instead of default increased the speed for me."

Artistic Range: The Turbo version prioritizes photorealism over stylistic variety compared to heavily fine-tuned models.

What Reddit Users Say About Z-Image Turbo

"I love this model, I'm speechless. It's the one we've all been waiting for... It's fast (3.4 seconds for 1280*800), powerful (painters and drawers styles etc.), lightweight compared to flux.2 and not censored." - u/Kaduc21

"The output is amazing for a 6b distilled model. Training a bunch of Loras and merging them with the base model would improve it a lot." - u/Shockbum

"It reminds Stable Diffusion 1.5 at the release, but better. Same freedom, no constraints." - u/Kaduc21

"Speed to aesthetic quality ratio is excellent." - u/abnormal_human

"WOW! SDXL SUCCESSOR!" - u/Shockbum

"I really hope so. I still prefer SDXL over any other newer model. It's just easier to iterate on and make a variety of pictures instead of waiting a minute or so per image" — u/SoulTrack

Z-Image Model Variants

Z-Image-Turbo (available now): "A distilled version of Z-Image with strong capabilities in photorealistic image generation, accurate rendering of both Chinese and English text, and robust adherence to bilingual instructions. It achieves performance comparable to or exceeding leading competitors with only 8 steps." Run it now on Floyo.

Z-Image-Base (coming soon): "The non-distilled foundation model. By releasing this checkpoint, we aim to unlock the full potential for community-driven fine-tuning and custom development."

Z-Image-Edit (coming soon): "A continued-training variant of Z-Image specialized for image editing. It excels at following complex instructions to perform a wide range of tasks, from precise local modifications to global style transformations, while maintaining high edit consistency."


Technical Architecture

Z-Image uses a Scalable Single-Stream DiT (S3-DiT) architecture where text, visual semantic tokens, and image VAE tokens are concatenated into one unified input stream. This approach is more parameter-efficient than dual-stream architectures.

The official docs explain: "In this setup, text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches."

The speed comes from Decoupled-DMD (Distribution Matching Distillation) - a technique that compresses the larger base model while preserving quality for few-step generation.

Read more

N