Home / Models / Wan 2.7 Video on Floyo
AI VIDEO GENERATION + EDITING
Run Wan 2.7 Video on Floyo
Four models. Text-to-video, image-to-video with first/last frame control, multi-reference generation, and instruction-based video editing. Native audio. Up to 1080p.
Run Alibaba's Wan 2.7 Video through ComfyUI workflows in your browser. No API key, no installs, no local GPU.
Resolution
Up to 1080p
Duration
2 to 15 seconds
Video References
Up to 5
Audio
Native audio-visual sync
No installation. Runs in browser. Updated April 2026.


_1775549531890.gif?width=1400&height=620&quality=80&resize=cover)
_1775549531890.gif?width=1400&height=620&quality=80&resize=cover)
_1775549666664.gif?width=1400&height=620&quality=80&resize=cover)
_1775549666664.gif?width=1400&height=620&quality=80&resize=cover)

_1775549531890.gif?width=104&height=104&quality=80&resize=cover)
_1775549666664.gif?width=104&height=104&quality=80&resize=cover)
What You Get
Wan 2.7 is a unified image generation, image editing, and video generation model family from Alibaba Tongyi Lab. The image models use a thinking mode that reasons about composition before generating. They support up to 4K resolution, accurate text rendering, hex color control, and multi-image editing with up to 9 references. The video models add first/last frame control, 9-grid I2V, and instruction-based video editing. Available as ComfyUI nodes on Floyo.
WAN 2.7 WORKFLOWS ON FLOYO
What is Wan 2.7 Image?
Wan 2.7 Image is a unified generation-and-editing model from Alibaba Tongyi Lab, released April 1, 2026. It introduces thinking mode, where the model reasons about composition, spatial relationships, and prompt logic before generating. Four variants are available: Text-to-Image (up to 2048x2048), Text-to-Image Pro (up to 4096x4096), Image Edit (up to 2048x2048), and Image Edit Pro (up to 2K enhanced).
The biggest change from previous Wan image capabilities is how the model processes your prompt. Most image models run a single forward pass. Wan 2.7 Image adds a reasoning step first. The model analyzes what you asked for (spatial layout, object relationships, text content) and plans the composition before generating pixels. The trade-off is slightly longer generation time. The payoff is better prompt adherence, especially for complex scenes with multiple elements.
Three persistent problems in AI image generation get direct fixes here. First: all AI faces look the same. Wan 2.7 Image lets you specify bone structure, face shape, and eye style in your prompt to create distinct, believable characters. Second: color drift. A new palette extraction feature lets you input hex codes or drop in a reference image to lock colors to your brand guide. Third: text rendering. Previous models could handle a short headline at best. Wan 2.7 Image renders readable text up to 3,000 tokens, enough to fill an A4 page.
On Floyo, you access Wan 2.7 Image through ComfyUI nodes. Floyo runs the model on H100 NVL GPUs with 94GB VRAM, so you get full-speed generation without needing your own hardware or managing model downloads.
What are Wan 2.7 Image's technical specifications?
Wan 2.7 Image uses a unified generation-and-understanding architecture that maps text and visual semantics into a shared latent space. Four model variants cover text-to-image and image editing at standard and pro tiers. The Pro tier supports 4K output (4096x4096). All variants include thinking mode for improved prompt adherence and composition planning.
| Spec | Details |
|---|---|
| Developer | Alibaba Tongyi Lab |
| Model Variants | Text-to-Image, Text-to-Image Pro, Image Edit, Image Edit Pro |
| Max Resolution (Standard) | 2048x2048 (2K) |
| Max Resolution (Pro) | 4096x4096 (4K) |
| Thinking Mode | Built-in chain-of-thought reasoning (on by default for T2I) |
| Text Rendering | Up to 3,000 tokens of readable text per image |
| Reference Images | Up to 9 for editing, style transfer, and multi-reference fusion |
| Image Set Generation | Up to 12 coherent images per request |
| Prompt Length | Up to 5,000 characters |
| Color Control | Hex code input and palette extraction from reference images |
| ComfyUI Access | Wan image nodes in ComfyUI (search "Wan" in canvas) |
| Release Date | April 1, 2026 |
What is Wan 2.7's thinking mode?
Thinking mode is a reasoning step that runs before image generation. The model analyzes your prompt for spatial relationships, composition logic, and semantic intent, then plans the image layout before producing pixels. It is built into Wan 2.7 Text-to-Image and enabled by default. The result is better prompt adherence, especially for complex multi-element scenes.
This matters most for prompts that describe specific spatial arrangements ("three products arranged left to right in ascending size"), multi-element compositions ("a woman reading in a cafe with rain on the window and warm interior lighting"), and scenes requiring logical consistency ("a reflection in a mirror showing the back of the room"). Single-pass models often lose coherence on these kinds of prompts. Thinking mode reduces those failures.
The trade-off is generation time. Thinking mode adds a reasoning step, so each image takes slightly longer to produce. For simple prompts (a single subject on a plain background), the quality gain is minimal. For complex prompts, the improvement in composition and spatial accuracy is significant.
What are Wan 2.7 Image's key features?
Wan 2.7 Image combines generation and editing in a single model architecture. The feature set targets three long-standing problems in AI image generation: generic faces, unpredictable colors, and broken text rendering. Each feature below is confirmed from Alibaba's official documentation and third-party testing.
Thinking Mode
The model reasons about composition, spatial relationships, and prompt logic before generating. This produces more coherent images for complex prompts with multiple elements, specific layouts, or logical requirements like reflections and shadows. Enabled by default on Text-to-Image.
Text Rendering
Wan 2.7 Image renders readable text in generated images. Signs are legible. Product labels are accurate. Typography in posters and book covers looks designed rather than garbled. The model supports up to 3,000 tokens of text, enough to fill charts, formulas, and dense layouts. This has been the most persistent failure mode in AI image generation, and Wan 2.7 addresses it directly.
Face Personalization
You can specify bone structure, face shape (round, square, oblong), and eye style (narrow, deep-set, wide) directly in your prompt. The result is characters that look like specific, distinct individuals rather than a blended average. This is critical for storyboards, brand personas, and e-commerce models where character consistency matters.
Hex Color Control
Enter specific hex color codes and proportions in your prompt to lock the output to your brand palette. You can also drop in a reference image (a mood board, painting, or screenshot from your design system) and the model extracts the color distribution and applies it to the generated output. This removes a full round of post-production color correction for teams working under brand guidelines.
Multi-Image Editing
Upload up to 9 reference images alongside a text prompt. The model can apply style transfer, swap elements between images, and fuse multiple references into a single output. Identity is preserved where it should be: change a background to a beach sunset, and the face, pose, and clothing stay pixel-perfect while only the background transforms.
Image Set Generation
Generate up to 12 coherent images from a single prompt. The model maintains stylistic consistency across the full set. Use cases include the same character across different scenes (a cat through four seasons), product shots from different angles, storyboard sequences, and social media kits. Structured prompts that describe each image in the set produce the best results.
Click-to-Edit
Select specific areas of an image to add, move, or align elements with pixel-level accuracy. This interactive editing approach gives you precise control over individual parts of the composition without affecting the rest of the image. Available through the Image Edit variants.
What can you create with Wan 2.7 Image?
Wan 2.7 Image covers text-to-image generation, multi-reference image editing, style transfer, image set generation, and interactive click-to-edit. The combination of thinking mode, text rendering, and color control makes it suited for production workflows where brand consistency and prompt accuracy matter more than raw speed.
| Capability | What It Does | Use Case |
|---|---|---|
| Text-to-Image | Generate images from text prompts with thinking mode reasoning. Up to 2K standard, 4K with Pro. | Marketing assets, concept art, social media content |
| Image Editing | Edit images with natural language instructions. Upload up to 9 references for style transfer and element fusion. | Client revisions, product variant generation, background replacement |
| Text Rendering | Generate readable text inside images. Supports up to 3,000 tokens including charts and formulas. | Product labels, signage, poster design, infographics |
| Image Set Generation | Generate up to 12 coherent images from one prompt. Maintains character and style consistency across the set. | Storyboards, product catalogs, social media kits, presentation decks |
| Color Control | Lock output to specific hex codes or extract palettes from reference images for brand-accurate colors. | Brand campaigns, design system compliance, product photography |
| Face Personalization | Specify bone structure, face shape, and eye style in prompts to generate distinct, individual characters. | Character design, e-commerce model generation, avatar creation |
How does Wan 2.7 Image compare to other image models?
Wan 2.7 Image leads on instruction following, text rendering accuracy, and multi-reference editing. Midjourney V8 leads on artistic aesthetics. FLUX is faster for simple prompts with strong LoRA support. Seedream produces high visual quality but lacks thinking mode reasoning. Wan 2.7 is the only model in this group with built-in hex color control and image set generation up to 12 images.
| Model | Max Resolution | Text Rendering | Thinking Mode | API Access |
|---|---|---|---|---|
| Wan 2.7 | 4K (4096x4096) | Accurate (3,000 tokens) | Yes (built-in) | Yes |
| Midjourney V8 | Up to 2K | Improving, but inconsistent | No | No |
| FLUX | Up to 2K | Moderate | No | Yes |
| Seedream | Up to 2K | Limited | No | Yes |
Source: WaveSpeedAI comparison data, Alibaba official documentation, and third-party reviews as of April 2026. Midjourney does not offer API access. Aesthetic quality is subjective and not captured in this table.
What is Wan 2.7 Video?
Wan 2.7 Video is a video generation model from Alibaba Tongyi Lab that adds bidirectional frame control (first + last frame), 9-grid multi-image I2V, natural language video editing, combined subject and voice reference-to-video, and up to 5 simultaneous video references. It uses a Diffusion Transformer architecture with MoE routing and generates up to 1080p video with native audio.
The biggest change from Wan 2.6 is the number of control inputs the model accepts in a single generation call. Previous versions gave you a prompt and a starting image. Wan 2.7 adds last-frame anchoring, 9-grid image arrays, combined voice and appearance references, natural language editing, and up to 5 simultaneous video references.
Instruction-based video editing is the feature that makes 2.7 feel qualitatively different from a pure generation model. You pass an existing video alongside a natural language instruction ("change the background to a rain-soaked street," "swap the jacket to red") and receive an edited output rather than a full regeneration. Iteration cycles that required re-generating from scratch can now be handled as lightweight edits.
On Floyo, you access Wan 2.7 Video through ComfyUI nodes. Floyo runs the model on H100 NVL GPUs with 94GB VRAM, so you get full-speed generation without needing your own hardware or managing model downloads.
Fair warning: Wan 2.7 Image launched April 1, 2026. Wan 2.7 Video launched on select cloud platforms first. Open weights for local deployment have not been officially confirmed as of this writing. Based on the Wan family's consistent open-source pattern (2.1 and 2.2 were both released under Apache 2.0), open weights are expected within 4 to 8 weeks of cloud launch. Some features like instruction-based editing and video recreation are new and should be considered promising but still maturing.
What are Wan 2.7 Video's technical specifications?
Wan 2.7 Video uses a Diffusion Transformer with MoE (Mixture of Experts) routing and a T5 text encoder. It generates video at up to 1080p resolution for 2 to 15 seconds with native audio. New in 2.7: bidirectional frame control, 9-grid multi-image input, instruction-based editing, combined subject+voice R2V, and support for up to 5 simultaneous video references.
| Spec | Details |
|---|---|
| Architecture | Diffusion Transformer with T5 encoder and MoE routing |
| Resolution | Up to 1080p (4K reported in some configurations) |
| Duration | 2 to 15 seconds (up to 20-30 seconds reported) |
| Frame Rate | 24fps |
| Audio | Native audio-visual generation with synchronization |
| Frame Control | First + Last frame (bidirectional, new in 2.7) |
| Multi-Image Input | 9-grid layout (3x3 image array, new in 2.7) |
| Video References | Up to 5 simultaneous (up from 1-2 in 2.6) |
| Reference-to-Video | Combined subject appearance + voice in a single pass |
| Video Editing | Instruction-based editing via natural language (new in 2.7) |
| Aspect Ratios | 16:9 (landscape), 9:16 (portrait), 1:1 (square) |
| ComfyUI Access | Wan video nodes in ComfyUI |
What can you create with Wan 2.7 Video?
Wan 2.7 Video covers text-to-video, image-to-video (single and 9-grid), first/last frame video, instruction-based video editing, reference-to-video with combined subject and voice, and video recreation. The editing and multi-reference features are new in 2.7 and change how you iterate on video content.
| Capability | What It Does | Use Case |
|---|---|---|
| First/Last Frame | Define both start and end frames. The model generates motion between them, reducing generation cycles by constraining the output space. | Storyboarding, looping content, narrative sequences |
| 9-Grid I2V | Provide a 3x3 array of 9 images. The model converts them into a continuous video with smooth transitions between panels. | Product showcases, character turnarounds, multi-scene storyboards |
| Video Editing | Edit existing video with natural language instructions. Change backgrounds, swap objects, adjust lighting without full regeneration. | Post-production adjustments, client revisions, scene variations |
| Reference-to-Video | Combine a subject appearance reference with a voice reference in one generation pass. No multi-pass pipeline needed. | Virtual presenters, character-led campaigns, educational content |
| Multi-Reference | Use up to 5 simultaneous video references (up from 1-2 in 2.6) for multi-character scenes and brand consistency. | Multi-person product demos, brand ambassador content |
| Video Recreation | Provide a reference video and describe changes. The model preserves the original motion structure while rebuilding the visual layer. | Adapting trending formats, converting footage to animation style |
What changed from Wan 2.6 to Wan 2.7?
Wan 2.7 adds an entire image generation and editing suite (with thinking mode, 4K output, and text rendering) alongside five major video upgrades over 2.6: bidirectional frame control, 9-grid multi-image I2V, instruction-based video editing, combined subject+voice reference-to-video, and support for up to 5 simultaneous video references.
| Feature | Wan 2.6 | Wan 2.7 |
|---|---|---|
| Image Generation | Basic T2I and editing | Thinking mode, 4K, text rendering, hex color, 12-image sets |
| Frame Anchoring | First frame only | First + Last frame |
| Multi-Image Input | Not supported | 9-grid (3x3 image array) |
| Video Editing | Not supported | Instruction-based editing via natural language |
| Reference Types | Voice OR subject (separate) | Voice + subject (combined, single pass) |
| Video References | 1 to 2 | Up to 5 |
Source: Alibaba official documentation, WaveSpeedAI feature comparisons, and Replicate model listings as of April 2026.
How does Wan 2.7 work?
Wan 2.7 Image uses a unified generation-and-understanding architecture that maps text and visual semantics into a shared latent space. Rather than treating image generation and image comprehension as separate tasks, the model couples them from the start. Wan 2.7 Video uses a Diffusion Transformer with MoE routing, where high-noise and low-noise experts specialize in different phases of the denoising process.
On the image side, the shared latent space means the model does not need to guess what your words mean. Text and image are tightly linked during training on multimodal instructions (text + image inputs together). This pushes the model beyond surface-level pixel fitting into layout planning and composition reasoning. The thinking mode adds an explicit reasoning step on top of this architecture.
On the video side, the bidirectional frame control works by accepting both a first and last frame as constraints. The model generates motion trajectories between the two endpoints, which reduces the number of generation attempts needed to get the right motion path. The 9-grid I2V mode takes a 3x3 arrangement of nine reference images and converts them into a continuous video. The grid reads left-to-right, top-to-bottom, so the panel sequence determines the scene order.
On Floyo, both image and video models run through ComfyUI nodes on H100 NVL GPUs. You can chain Wan 2.7 Image with Wan 2.7 Video in the same workflow. Generate a character image with thinking mode, then animate it with first/last frame control, then upscale the result. All in one pipeline, all in your browser.
Frequently Asked Questions
Common questions about running Wan 2.7 on Floyo.
You can try it with Floyo's free tier, which gives you 20 minutes of GPU time per day. Previous Wan versions (2.1, 2.2) were open-source under Apache 2.0, so there was no additional API cost beyond GPU time. Pricing for 2.7 depends on how it is made available on Floyo (open-source model vs. API node). Floyo gives $1 in free API credits on signup for API-based models.
Open Floyo in your browser, find a Wan 2.7 workflow (search "Wan 2.7" in the template library), and click Run. Floyo handles the GPU, the ComfyUI environment, and the model weights. No local install, no Python setup, no API key required.
Alibaba Tongyi Lab, the same team behind the entire Wan model series. Wan 2.7 Image was released April 1, 2026. Wan 2.7 Video launched on cloud platforms in late March 2026. Earlier versions in the series were open-sourced on GitHub and HuggingFace.
A reasoning step where the model analyzes composition, spatial relationships, and prompt logic before generating an image. It produces more coherent compositions and better prompt adherence, especially for complex multi-element scenes. It is built into Text-to-Image and enabled by default. The trade-off is slightly longer generation time.
Yes. Wan 2.7 Image has significantly improved text rendering compared to previous generations and most competitors. Signs, labels, and typography are readable and accurate. The model supports up to 3,000 tokens of text per image, enough to fill charts, formulas, and dense page layouts.
Wan 2.7 leads on instruction following, text rendering accuracy, and multi-reference editing. Midjourney V8 leads on artistic aesthetics. Wan 2.7 supports 4K output (Pro tier), hex color control, and API access. Midjourney does not offer API access. If your workflow requires precise text in images or brand-accurate colors, Wan 2.7 has the edge. If you want the strongest artistic "vibe," Midjourney is still hard to beat.
Yes. Floyo runs ComfyUI, which lets you chain multiple models in a single workflow. Generate a character with Wan 2.7 Image using thinking mode, animate it with Wan 2.7 Video using first/last frame control, then upscale the result. All in one pipeline, all in your browser.
Yes. Wan 2.7 Video includes native audio-visual generation. Audio and video are synchronized during generation. The enhanced reference-to-video mode can also combine a voice reference with a subject appearance reference in a single pass.
For images: up to 2048x2048 (standard) or 4096x4096 (Pro tier). For video: up to 1080p at 24fps, with 4K reported in some configurations. Video duration ranges from 2 to 15 seconds, with up to 20-30 seconds reported in extended modes.
As of this writing, Wan 2.7 launched on cloud platforms and via API first. Open weights have not been officially confirmed for the 2.7 series. Based on the Wan family's pattern (2.1 and 2.2 were both open-sourced under Apache 2.0), open weights are expected within 4 to 8 weeks of the cloud launch. On Floyo, you can run Wan 2.7 through ComfyUI without needing local weights.
Wan 2.7 is Now Live on Floyo
Thinking mode image generation, 4K output, text rendering, first/last frame video, and native audio. Run it in your browser.
Related Reading
Film and Animation Workflows on Floyo
Setting Up an AI Production Pipeline for Your Studio
Last updated: April 2026. Specs from Alibaba Tongyi Lab official documentation, Alibaba Cloud press release, WaveSpeedAI model listings, Replicate model documentation, and third-party reviews.
Alibaba
Audio
Text to Video
Wan 2.7
Generate video from a text prompt using Alibaba's Wan 2.7 model. Set your resolution, aspect ratio, and duration, then hit Run. Audio input supported.
Wan 2.7 - Text to Video
Generate video from a text prompt using Alibaba's Wan 2.7 model. Set your resolution, aspect ratio, and duration, then hit Run. Audio input supported.
