Z-Image Base for Text to Image

Create sunning images using z-image base model (non distlled).

Text2Image

Z-Image

Z-image-base

1.2k

Z‑Image‑Base is the full, non‑distilled version of the Z‑Image text‑to‑image model: a 6B S3‑DiT foundation model that keeps the complete training signal for maximum fidelity and controllability.

Overview

Z‑Image‑Base uses a single‑stream diffusion transformer where text, semantic, and image tokens share one sequence, making it efficient while still large enough for high‑quality outputs. As the undistilled checkpoint, it is not compressed or RL‑tuned for speed; it runs more steps than Turbo but offers richer detail, more nuanced styles, and better behavior under heavy classifier‑free guidance.

Why the base (non‑distilled) model matters

Full training signal: No distillation means generations reflect the original training distribution more faithfully, which helps with subtle textures, complex scenes, and edge cases.
Better for CFG and prompt engineering: It supports full CFG ranges and tends to respond smoothly to high guidance values, making it suitable for precise prompt‑driven control.
Fine‑tuning foundation: Designed as a “foundation checkpoint” for community LoRAs and custom training (styles, characters, products, domains) without fighting distillation artifacts.

Text‑to‑image behavior

Photorealistic images with strong aesthetics at around 1024×1024 and flexible custom aspect ratios.
Very good bilingual (Chinese + English) text rendering and semantic understanding, including labels, signs, and UI text.
More “organic” outputs than many RL‑polished models—users report it feels less like a uniform “vending machine” and more exploratory, especially at diverse seeds.

Where to use Base vs Turbo

Choose Base when: you care most about fidelity, stylistic nuance, or you plan to fine‑tune / LoRA on top of it.
Choose Turbo when: latency and cost dominate (interactive apps, huge batches) and slight loss of maximum quality is acceptable.

Generates in about 28 secs

floyoofficial

Nodes & Models

ComfyUI Official

UNETLoader

z_image_bf16.safetensors

VAELoader

ae.safetensors

MarkdownNote

EmptySD3LatentImage

WorkflowGraphics

CLIPLoader

qwen_3_4b.safetensors

ModelSamplingAuraFlow

CLIPTextEncode

KSampler

VAEDecode

SaveImage

Overview

Why the base (non‑distilled) model matters

Full training signal: No distillation means generations reflect the original training distribution more faithfully, which helps with subtle textures, complex scenes, and edge cases.
Better for CFG and prompt engineering: It supports full CFG ranges and tends to respond smoothly to high guidance values, making it suitable for precise prompt‑driven control.
Fine‑tuning foundation: Designed as a “foundation checkpoint” for community LoRAs and custom training (styles, characters, products, domains) without fighting distillation artifacts.

Text‑to‑image behavior

Photorealistic images with strong aesthetics at around 1024×1024 and flexible custom aspect ratios.
Very good bilingual (Chinese + English) text rendering and semantic understanding, including labels, signs, and UI text.
More “organic” outputs than many RL‑polished models—users report it feels less like a uniform “vending machine” and more exploratory, especially at diverse seeds.

Where to use Base vs Turbo

Choose Base when: you care most about fidelity, stylistic nuance, or you plan to fine‑tune / LoRA on top of it.
Choose Turbo when: latency and cost dominate (interactive apps, huge batches) and slight loss of maximum quality is acceptable.