ThinkDiffusion

Product

Pricing

Enterprise

Docs

ThinkDiffusion

Z-Image Turbo with Controlnet 2.1 and Qwen VLM for Creating Accurate Variety of Images

API

Controlnet

Floyo API

Image2Image

LoRA

Z-Image Turbo

852

Overview

Z-Image Turbo is a 6B diffusion transformer that generates high‑quality images in a few steps and works well with ControlNet nodes in ComfyUI for pose, depth, or edge guidance. ControlNet (via DWPose, Canny, Depth, or Union) lets you lock in composition, pose, and structure from a reference image while still using text to define style and details. Qwen VLM (Qwen-VL / Qwen2.5-VL) is a vision‑language model that can analyze images, describe them, refine prompts, or validate whether the generated image matches your textual intent.

Who can use it

This combo is useful for:

Creators who want consistent characters and poses across many images, using ControlNet for structure and Qwen VLM to keep prompts and outputs on‑brief.
Designers and marketers needing accurate branded visuals, where Qwen VLM checks logos, colors, or layout while Z-Image Turbo + ControlNet keep layout fixed.
ComfyUI power users building complex graphs that mix text‑to‑image, reference guidance, and VLM‑driven prompt refinement for higher reliability.
Anyone doing dataset creation or concept exploration who wants an automated loop: generate → analyze with Qwen VLM → adjust prompt or ControlNet → regenerate.

Use case workflow

A typical workflow is:

Feed a pose or layout image into ControlNet (DWPose, Depth, or Canny) and write an initial prompt for Z-Image Turbo, then generate a first batch of guided images.
Send one or more outputs to Qwen VLM, ask it to describe the image or compare it to your intended description, and use its detailed text as an improved prompt or prompt expansion.
Regenerate with Z-Image Turbo + ControlNet using the refined prompt, optionally repeating the loop until Qwen VLM’s analysis says the image closely matches the target concept, pose, and details.

This way, Z-Image Turbo provides speed and quality, ControlNet provides spatial accuracy, and Qwen VLM provides semantic accuracy, giving you a robust system for creating a wide variety of precise, repeatable images.

Generates in about 54 secs

floyoofficial

Nodes & Models

Floyo API Nodes

VLM_floyo

ComfyUI Official

CLIPLoader

qwen_3_4b.safetensors

VAELoader

ae.safetensors

Note

UNETLoader

z_image_turbo_bf16.safetensors

LoadImage

ModelPatchLoader

Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors

LoraLoaderModelOnly

dejpeg_v3.safetensors

ImageScaleToMaxDimension

GetImageSize

CLIPTextEncode

EmptySD3LatentImage

PreviewImage

QwenImageDiffsynthControlnet

ConditioningZeroOut

ModelSamplingAuraFlow

KSampler

VAEDecode

SaveImage

ComfyUI-Inference-Core-Nodes

AIO_Preprocessor

comfyui_controlnet_aux