Create with Alibaba Happy Horse model now! Try here 👉

Pricing

Create with Alibaba Happy Horse model now! Try here 👉

COMMUNITY PAGE

Run Ernie on Floyo

Home / Model / ERNIE Image on Floyo

AI IMAGE GENERATION

Run ERNIE Image on Floyo

Baidu's 8B parameter text-to-image model with built-in prompt enhancer. Exceptional text rendering for posters, infographics, and UI mockups. Bilingual (Chinese + English). Apache 2.0 licensed.

Run Baidu's ERNIE Image through ComfyUI in your browser. No API key, no installs, no local GPU.

Parameters

8B (DiT)

Text Rendering

LongTextBench: 0.9733

Architecture

Single-stream DiT + Prompt Enhancer

License

Apache 2.0

Try ERNIE Image Now → Browse All Models

No installation. Runs in browser. Updated April 2026.

What you get?

ERNIE Image is Baidu's 8B parameter open-source text-to-image model, released April 15, 2026 under the Apache 2.0 license. Built on a single-stream Diffusion Transformer (DiT) with a lightweight Prompt Enhancer that expands short prompts into detailed visual descriptions before generation. Scores 0.9733 on LongTextBench (text rendering) and 0.8856 on GenEval (instruction following). Handles posters, infographics, UI mockups, comics, multi-panel layouts, and bilingual Chinese/English text with precision. Two variants: SFT (50 steps, max quality) and Turbo (8 steps, 6x faster). Runs on a single 24GB consumer GPU. Available as a ComfyUI node on Floyo.

ERNIE IMAGE WORKFLOWS ON FLOYO

ERNIE Image - Text to Image

What is ERNIE Image?

ERNIE Image is Baidu's open-source text-to-image model, released April 15, 2026. It is an 8 billion parameter single-stream Diffusion Transformer (DiT) paired with a lightweight Prompt Enhancer. With only 8B DiT parameters, it matches or exceeds larger open-source models across multiple benchmarks. The model is specifically designed for tasks that trip up most image generators: legible in-image text, structured layouts, posters, infographics, comics, and multi-panel compositions.

The Prompt Enhancer is what makes ERNIE Image forgiving for short prompts. Type "an abandoned Victorian mansion overtaken by vines, oil painting style" and the enhancer rewrites it into a detailed visual description with lighting, mood, and composition before the DiT sees it. This compensates for the smaller model size: a smart enhancer plus a focused 8B DiT produces output that competes with 20B+ models.

Text rendering is ERNIE Image's strongest capability. It scores 0.9733 on LongTextBench, the benchmark for dense, long-form, and layout-sensitive text in generated images. Posters with multi-line headlines, infographics with data labels, UI mockups with button text, and menus with item lists all render legibly. Both Chinese and English text render cleanly in the same generation pass.

Two variants ship with the release. The SFT variant runs at guidance scale 4.0 and 50 steps for maximum quality. The Turbo variant uses DMD (Diffusion Model Distillation) and reinforcement learning to compress inference from 50 steps to 8, achieving 6x speed improvement while maintaining high quality output.

On Floyo, ERNIE Image runs through native ComfyUI nodes on H100 NVL GPUs. The workflow includes the Prompt Enhancer toggle, configurable resolution, steps, CFG, and seed. Type a prompt and generate. No model downloads, no local GPU required.

What are ERNIE Image's technical specifications?

ERNIE Image uses an 8B parameter single-stream Diffusion Transformer with a lightweight Prompt Enhancer (Ministral 3B). Two variants: SFT (50 steps, CFG 4.0, max quality) and Turbo (8 steps, CFG 1.0, 6x faster). Default resolution is 1024x1024. Runs on a single 24GB consumer GPU. Bilingual Chinese/English prompts and in-image text. Apache 2.0 licensed with open weights on HuggingFace.

Spec	Details
Developer	Baidu (ERNIE-Image Team)
Architecture	Single-stream Diffusion Transformer (DiT) + lightweight Prompt Enhancer
DiT Parameters	8 billion
Prompt Enhancer	Ministral 3B text encoder (toggleable on/off)
VAE	Flux 2 VAE
SFT Variant	50 steps, guidance scale 4.0, maximum quality
Turbo Variant	8 steps, guidance scale 1.0, 6x faster (DMD + RL distilled)
Default Resolution	1024x1024 (also supports 832x1216 portrait, 1216x832 landscape)
Languages	English, Chinese, Japanese (prompts and in-image text)
LongTextBench	0.9733 (text rendering accuracy)
GenEval	0.8856 (instruction following)
OneIG-Bench (EN)	0.5750
OneIG-Bench (ZH)	0.5543
Min VRAM	24GB (single consumer GPU)
Deployment	Diffusers, SGLang, ComfyUI (Day-0 support)
License	Apache 2.0 (full commercial rights)
ComfyUI Access	Native support on Floyo (1 workflow)
Release Date	April 15, 2026

What are ERNIE Image's key features?

ERNIE Image's feature set is built around one insight: a smaller model with a smart prompt enhancer can match a larger model that takes raw prompts directly. The 8B DiT focuses on rendering. The Ministral 3B enhancer handles prompt understanding. This separation of concerns is why ERNIE Image punches above its weight class.

Built-in Prompt Enhancer

A lightweight Ministral 3B LLM rewrites your short prompt into a detailed visual description before the DiT sees it. A one-line idea like "cyberpunk street market at night" becomes a full paragraph describing neon colors, rain reflections, vendor stalls, atmospheric haze, and camera angle. This compensates for the 8B model's limitations in complex prompt understanding. Toggle it off when you want your exact wording untouched.

Text Rendering (0.9733 LongTextBench)

The strongest text rendering benchmark score among open-source models at this parameter count. Dense, long-form, and layout-sensitive text renders legibly. Posters with multi-line headlines, infographics with data labels and annotations, UI interfaces with button text, and menus with item lists all come out readable. Both Chinese and English text in the same image.

8B Efficiency

Runs on a single consumer GPU with 24GB VRAM. The model footprint in bfloat16 is about 29.5GB, which fits on RTX 4090 and similar cards with CPU offloading. Despite being significantly smaller than competitors like Qwen-Image (20B) or HunyuanImage 3.0 (80B MoE), it matches or exceeds them on text rendering and instruction following benchmarks.

Turbo Variant (6x Speed)

The Turbo variant uses DMD (Diffusion Model Distillation) and reinforcement learning to compress inference from 50 steps to 8. This produces 6x speed improvement while maintaining high quality. Use SFT for final production assets and Turbo for fast iteration and previews. Both share the same architecture and produce compatible output.

Structured Layouts and Comics

ERNIE Image handles multi-panel layouts, comic pages, and structured compositions that most image models cannot produce coherently. Panel boundaries, text placement, and visual hierarchy are maintained across the full image. This extends the model's use beyond single-scene generation into sequential visual storytelling.

Apache 2.0 License

Fully open source with commercial rights. Weights are on HuggingFace. Day-0 ComfyUI support was added in April 2026. Diffusers and SGLang deployment paths are both documented. Fine-tuning is supported through AI-Toolkit. GGUF weights are available through Unsloth.

How does ERNIE Image compare to other image models?

ERNIE Image leads on text rendering (LongTextBench 0.9733) among open-source models at its parameter count. LongCat leads on Chinese text specifically. Z-Image Turbo leads on inference speed. Nano Banana Pro leads on 4K native resolution and character consistency. GPT Image 2 leads on instruction fidelity with ~99% accuracy. ERNIE Image's edge: best text rendering per parameter, built-in prompt enhancer, and structured layout capabilities.

Model	Parameters	Text Rendering	Prompt Enhancer	License
ERNIE Image	8B	0.9733 LTB	Built-in (toggleable)	Apache 2.0
LongCat	6B	SOTA Chinese	No	Open source
Z-Image Turbo	6B	Good (EN + CN)	No	Apache 2.0
Nano Banana Pro	Gemini backbone	94%+	Thinking mode	Commercial API
FLUX2.dev	32B	Moderate	No	Non-commercial

Source: Baidu ERNIE-Image GitHub, GenEval benchmark, OneIG-Bench, LongTextBench, HuggingFace model card, and third-party benchmark comparisons as of April 2026.

How does ERNIE Image work?

ERNIE Image uses a two-stage pipeline. First, the Prompt Enhancer (Ministral 3B) rewrites your short input into a structured visual description. Second, the 8B single-stream Diffusion Transformer generates the image from that enriched description through the Flux 2 VAE. The enhancer reads your resolution settings and shapes its description to match the chosen aspect ratio.

The single-stream DiT architecture processes text and image tokens in the same sequence, which is why text rendering works so well. The model treats in-image text as part of the visual composition, not a separate overlay. This unified approach means the generated text follows the same lighting, perspective, and style as the rest of the image.

The Turbo variant uses two acceleration techniques. DMD (Diffusion Model Distillation) trains a student model to approximate the full 50-step output in fewer steps. Reinforcement learning then fine-tunes the distilled model to maintain quality at 8 steps. The result is a 6x speedup with minimal quality loss.

On Floyo, ERNIE Image runs through native ComfyUI nodes on H100 NVL GPUs. The workflow loads the model, applies the Prompt Enhancer (if enabled), runs the diffusion steps, and decodes through the Flux 2 VAE. You control resolution, steps, CFG scale, seed, negative prompt, and the enhancer toggle. Output is a PNG image.

Fair warning: ERNIE Image is a generation-only model. It does not support image editing, inpainting, or image-to-image workflows. For editing, use Qwen Edit 2511 or LongCat Edit. The Prompt Enhancer rewrites your text, which means you give up exact wording control when it is enabled. If your prompt needs to be followed precisely, turn the enhancer off. Training data details have not been disclosed by Baidu.

Frequently Asked Questions

Common questions about running ERNIE Image on Floyo.

Is ERNIE Image free to use on Floyo?

You can start with Floyo's free pricing plan. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. ERNIE Image is open-source under Apache 2.0, so there is no additional API cost beyond your Floyo plan.

How do I run ERNIE Image without installing anything?

Open Floyo in your browser, find the "ERNIE Image - Text to Image" workflow (search "ERNIE" in the template library), and click Run. Type your prompt, set resolution, and generate. Floyo handles the GPU, ComfyUI environment, and model weights. No local install, no Python setup.

Who made ERNIE Image?

Baidu's ERNIE-Image Team. The model was open-sourced on April 15, 2026 under the Apache 2.0 license. Weights are on HuggingFace (baidu/ERNIE-Image). ComfyUI added Day-0 support in April 2026. AMD validated Day-0 GPU support on both Instinct MI355X and Radeon AI PRO R9700.

Do I need to write long prompts for ERNIE Image?

No. The built-in Prompt Enhancer expands short prompts into detailed visual descriptions before the image model sees them. A one-line idea is enough. Turn the enhancer off if you want full control over wording. The enhancer reads your resolution settings and adapts its description to match the aspect ratio.

Can ERNIE Image render text in images?

Yes. This is ERNIE Image's strongest capability. It scores 0.9733 on LongTextBench, which measures dense, long-form, and layout-sensitive text rendering. Posters, infographics, UI mockups, menus, and labels all come out legible. Both Chinese and English text render cleanly in the same image.

Can I combine ERNIE Image with other AI models in one workflow?

Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate with ERNIE Image, refine with Qwen Edit 2511, animate with Wan 2.7, add voiceover with Fish Audio S2. Or use ERNIE Image for fast concept brainstorming and switch to a different model for final production.

Can I use ERNIE Image output commercially?

Yes. ERNIE Image is released under the Apache 2.0 license, which grants full commercial usage rights. You can use generated images in products, marketing, client work, and any other commercial context without additional licensing.

What settings should I use for ERNIE Image?

Start with the defaults: 1024x1024 resolution, 20 steps, CFG 4, euler sampler with simple scheduler. For faster previews, drop steps to 12-16. For tighter prompt adherence, increase CFG to 5-6 (knowing higher values can cause color artifacts). For portraits try 832x1216, for landscapes try 1216x832. Use a fixed seed to reproduce results.

Try ERNIE Image on Floyo

8B parameter text-to-image with built-in prompt enhancer, industry-leading text rendering, structured layouts, and bilingual Chinese/English support. Run it in your browser.

Try ERNIE Image Now → Browse All Models