Z-Image Base for Text to Image
Create sunning images using z-image base model (non distlled).
Text2Image
Z-Image
Z-image-base
3
1.2k
Z‑Image‑Base is the full, non‑distilled version of the Z‑Image text‑to‑image model: a 6B S3‑DiT foundation model that keeps the complete training signal for maximum fidelity and controllability.
Overview
Z‑Image‑Base uses a single‑stream diffusion transformer where text, semantic, and image tokens share one sequence, making it efficient while still large enough for high‑quality outputs. As the undistilled checkpoint, it is not compressed or RL‑tuned for speed; it runs more steps than Turbo but offers richer detail, more nuanced styles, and better behavior under heavy classifier‑free guidance.
Why the base (non‑distilled) model matters
Full training signal: No distillation means generations reflect the original training distribution more faithfully, which helps with subtle textures, complex scenes, and edge cases.
Better for CFG and prompt engineering: It supports full CFG ranges and tends to respond smoothly to high guidance values, making it suitable for precise prompt‑driven control.
Fine‑tuning foundation: Designed as a “foundation checkpoint” for community LoRAs and custom training (styles, characters, products, domains) without fighting distillation artifacts.
Text‑to‑image behavior
Photorealistic images with strong aesthetics at around 1024×1024 and flexible custom aspect ratios.
Very good bilingual (Chinese + English) text rendering and semantic understanding, including labels, signs, and UI text.
More “organic” outputs than many RL‑polished models—users report it feels less like a uniform “vending machine” and more exploratory, especially at diverse seeds.
Where to use Base vs Turbo
Choose Base when: you care most about fidelity, stylistic nuance, or you plan to fine‑tune / LoRA on top of it.
Choose Turbo when: latency and cost dominate (interactive apps, huge batches) and slight loss of maximum quality is acceptable.
Read more
Nodes & Models
Z‑Image‑Base is the full, non‑distilled version of the Z‑Image text‑to‑image model: a 6B S3‑DiT foundation model that keeps the complete training signal for maximum fidelity and controllability.
Overview
Z‑Image‑Base uses a single‑stream diffusion transformer where text, semantic, and image tokens share one sequence, making it efficient while still large enough for high‑quality outputs. As the undistilled checkpoint, it is not compressed or RL‑tuned for speed; it runs more steps than Turbo but offers richer detail, more nuanced styles, and better behavior under heavy classifier‑free guidance.
Why the base (non‑distilled) model matters
Full training signal: No distillation means generations reflect the original training distribution more faithfully, which helps with subtle textures, complex scenes, and edge cases.
Better for CFG and prompt engineering: It supports full CFG ranges and tends to respond smoothly to high guidance values, making it suitable for precise prompt‑driven control.
Fine‑tuning foundation: Designed as a “foundation checkpoint” for community LoRAs and custom training (styles, characters, products, domains) without fighting distillation artifacts.
Text‑to‑image behavior
Photorealistic images with strong aesthetics at around 1024×1024 and flexible custom aspect ratios.
Very good bilingual (Chinese + English) text rendering and semantic understanding, including labels, signs, and UI text.
More “organic” outputs than many RL‑polished models—users report it feels less like a uniform “vending machine” and more exploratory, especially at diverse seeds.
Where to use Base vs Turbo
Choose Base when: you care most about fidelity, stylistic nuance, or you plan to fine‑tune / LoRA on top of it.
Choose Turbo when: latency and cost dominate (interactive apps, huge batches) and slight loss of maximum quality is acceptable.
Read more


