Workflows

Pricing

COMMUNITY PAGE

Run Gemini Omni on Floyo

Home / Model / Gemini Omni on Floyo

AI VIDEO GENERATION

Run Gemini Omni on Floyo

Google DeepMind's "anything from anything" model. Unified architecture that generates video with synchronized audio from text, images, audio, and video inputs. Conversational multi-turn editing without regenerating from scratch. Fuses Gemini reasoning with Veo rendering and Genie world simulation.

Run Google's Gemini Omni through ComfyUI in your browser. No API key, no installs, no local GPU.

Architecture

Gemini + Veo + Genie (unified)

Duration

Up to 10 seconds (Flash)

Input

Text + image + audio + video

Editing

Conversational multi-turn

Coming Soon to Floyo → Browse All Models

No installation. Runs in browser. Updated May 2026.

What you get?

Gemini Omni is Google DeepMind's unified multimodal generation model, announced at Google I/O 2026 on May 19. It is a reasoning model that generates video, not a video model with reasoning bolted on. The architecture fuses Gemini's language reasoning with Veo's video rendering and DeepMind's Genie world simulation into a single system. Accept any combination of text, images, audio, and video as input. Get coherent video with synchronized audio out. Then keep editing in plain English, one instruction at a time, without regenerating from scratch. The first model, Gemini Omni Flash, launched across the Gemini app, Google Flow, and YouTube Shorts. API access confirmed for developers. Coming soon as a ComfyUI API node on Floyo.

What is Gemini Omni?

Gemini Omni is Google DeepMind's "anything from anything" model family, announced by CTO Koray Kavukcuoglu at Google I/O 2026 on May 19. The first release, Gemini Omni Flash, generates video with synchronized audio from any combination of text, images, audio, and video inputs. It replaces Veo inside the Gemini app and marks the shift from standalone video generation toward unified multimodal creation.

The key distinction from every other video model: Gemini Omni is a reasoning model that creates video, not a video model. It fuses three Google DeepMind systems: Gemini's language reasoning engine (understanding intent, context, and physics), Veo's video rendering capabilities (cinematic generation), and Genie's world simulation (physics-aware environment modeling). The result understands what you are asking, why you are asking it, and how the physical world behaves.

Conversational editing is the feature that separates Omni from everything else on the market. Generate a video, then tell the model "make the sky more dramatic," "add a person walking from left to right," or "change the music to something more tense." The model edits the existing generation without starting over. Character consistency, physics, and scene lighting persist across multiple rounds of edits. No other video model supports this at launch.

Gemini Omni Flash generates clips up to 10 seconds. Google's product management director Nicole Brichtova confirmed this is a deployment choice, not a model limitation. The 10-second ceiling maps directly to YouTube Shorts economics and keeps per-generation cost low for mass distribution. Resolution is reported at 720p at 24fps, though Google did not publish an official resolution spec at launch.

On Floyo, Gemini Omni will run through ComfyUI API nodes once developer API access opens. You will be able to chain it with other models in the same workflow. The ComfyUI integration is coming soon.

What are Gemini Omni's technical specifications?

Gemini Omni uses a unified architecture that fuses Gemini reasoning, Veo rendering, and Genie world simulation. The Flash variant generates up to 10 seconds of video with synchronized audio from any input combination (text, image, audio, video). Conversational multi-turn editing preserves context across edits. Reported resolution is 720p at 24fps. Anti-deepfake avatar onboarding for real-person generation. SynthID watermarking on all outputs.

Spec	Details
Developer	Google DeepMind
Architecture	Unified: Gemini reasoning + Veo rendering + Genie world simulation
First Model	Gemini Omni Flash (video output; image and audio output coming later)
Inputs	Text, images, audio, video, sketches (any combination)
Output (Flash)	Video with synchronized audio (up to 10 seconds)
Resolution	720p at 24fps (third-party reported; Google did not publish official spec)
Editing	Conversational multi-turn (edit without regenerating from scratch)
Text Rendering	In-video text rendering (demonstrated at I/O)
Physics	Genie-powered world simulation (physics-aware generation and editing)
Avatar Safety	Spoken-number handshake required for real-person depiction (anti-deepfake)
Watermark	SynthID (on all outputs)
Replaces	Veo inside the Gemini app
Available On	Gemini app, Google Flow, YouTube Shorts, YouTube Create App
API	Developer and enterprise API "coming in the coming weeks" (as of May 19)
ComfyUI Access	Coming soon to Floyo (pending API release)
Announcement	Google I/O 2026 (May 19, 2026)

What can you create with Gemini Omni?

Gemini Omni covers text-to-video, image-to-video, video-to-video editing, audio-conditioned video, multi-turn conversational editing, reference-based generation, text rendering within video, and physics-aware scene creation. The "anything from anything" design means any input combination works. The conversational editing means you refine output through dialogue, not re-prompting.

Capability	What It Does	Use Case
Any-to-Video	Feed any combination of text, images, audio, and video. Get coherent video with synchronized audio. The model interprets intent from the input mix.	Creative production, rapid prototyping, social content
Conversational Editing	Generate a video, then edit it through plain English instructions. "Make the lighting warmer." "Add rain." "Remove the person on the left." No re-generation.	Client revisions, iterative design, creative direction
Physics-Aware Generation	Genie world simulation ensures objects interact physically. Gravity, collisions, fluid dynamics, and lighting respond to the scene context.	Product demos, VFX previz, realistic scene building
In-Video Text	Render legible text within generated video. Signs, titles, lower-thirds, and on-screen copy appear clearly. Demonstrated at I/O.	Ad creatives, social graphics, branded content
Audio-Conditioned Video	Provide a voiceover or music track as input. The model generates video that matches the audio: lip-sync for speech, rhythm-matched cuts for music.	Music videos, podcast visuals, narration-driven content
Pipeline Integration	Chain with other models in ComfyUI on Floyo. Generate a character with Nano Banana, animate with Gemini Omni, add custom narration with Fish Audio S2, upscale with Topaz.	Multi-model production pipelines

What are Gemini Omni's key features?

Gemini Omni's feature set is defined by one architectural decision: merge reasoning, rendering, and world simulation into a single model. Every feature follows from this unification. Conversational editing works because the model understands context. Physics works because the model simulates the world. Audio sync works because the model processes all modalities together.

Unified Architecture (Gemini + Veo + Genie)

Gemini Omni is not three models stitched together. It is a single architecture that fuses Gemini's language reasoning (understanding what you mean), Veo's video rendering (generating the visual output), and Genie's world simulation (modeling physics and causality). This is why the model can reason about a scene before rendering it, and why edits preserve physical consistency.

Conversational Multi-Turn Editing

Generate a video, then refine it through dialogue. "Make the sky more dramatic." "Slow down the camera pan." "Add a lens flare from the top right." The model preserves character consistency, scene lighting, and physics across edits without regenerating from scratch. This is unlike every other video model where editing means re-prompting and hoping the new generation is close to the previous one.

Any-Input Flexibility

Text, images, audio, video, and sketches all work as inputs in any combination. Provide a voiceover track and a reference image and the model generates a video with the character from the image speaking the audio. Provide a video clip and a text instruction and the model edits the clip. This flexibility eliminates the need to choose between separate T2V, I2V, and V2V models.

Physics-Aware World Simulation

The Genie component gives Gemini Omni an internal model of how the physical world works. Objects fall. Water flows. Light bounces. Camera motion follows real cinematography physics. This is not post-processing. The model reasons about physics during generation, which is why edited scenes maintain physical plausibility even after multiple rounds of changes.

Anti-Deepfake Avatar Controls

To generate video featuring your own face, you record yourself speaking a number sequence. This spoken-number handshake verifies identity and prevents unauthorized deepfake generation. The control is enforced at the model level and cannot be bypassed. This is the most aggressive safety measure on any consumer video generation platform.

YouTube Shorts Integration

Gemini Omni Flash launched simultaneously on YouTube Shorts and YouTube Create App for free. The 10-second clip ceiling maps directly to Shorts format. This gives creators direct access to the model inside the platform where the content will be published, without export, upload, or format conversion steps.

How does Gemini Omni compare to other video models?

Gemini Omni is the first model that fuses reasoning, rendering, and world simulation in a single architecture. HappyHorse 1.0 leads on arena ranking and cinematic depth-of-field. Vidu Q3 leads on duration (16 seconds). Kling 3.0 Omni leads on 4K at 60fps. Seedance 2.0 leads on multi-modal reference input. Gemini Omni's edge: conversational editing, any-input flexibility, and physics-aware reasoning from the Genie world model.

Model	Multi-Turn Edit	Any-Input	Duration	Resolution
Gemini Omni Flash	Yes (conversational)	Text+image+audio+video	10 seconds	720p (reported)
HappyHorse 1.0	V2V editing	Text+image+video	15 seconds	1080p
Vidu Q3	No	Text+image	16 seconds	1080p
Kling 3.0 Omni	Unified (no-mask)	Text+image+video	15 seconds	4K @ 60fps

Source: Google I/O 2026 keynote (May 19, 2026), Google DeepMind official blog, TechCrunch reporting, and third-party early access reviews. Resolution spec is third-party reported; Google has not published official resolution documentation as of launch.

How does Gemini Omni work?

Gemini Omni fuses three Google DeepMind systems into a unified architecture. Gemini provides the language reasoning engine: understanding intent, context, and multi-turn dialogue. Veo provides the video rendering capabilities: cinematic generation and visual quality. Genie provides the world simulation: physics-aware modeling of objects, environments, and causality. All three operate as one system, not a pipeline of separate models.

When you prompt Gemini Omni, the Gemini reasoning layer first interprets your instruction. It plans the scene: what objects exist, where they are in 3D space, how they interact, what the camera should do. The Genie world model validates the plan against physical rules: gravity, momentum, lighting, material properties. The Veo rendering engine then generates the frames and audio based on the validated plan.

Conversational editing works because the model maintains a persistent representation of the generated scene. When you say "make the lighting warmer," the model does not regenerate from scratch. It modifies the lighting parameters in the existing scene representation and re-renders only what changed. Character positions, physics, and unaffected elements persist. This is fundamentally different from re-prompting a generation model.

On Floyo, Gemini Omni will run through ComfyUI API nodes once the developer API is available (confirmed as "coming in the coming weeks" at I/O). Your inputs and instructions will be sent to Google's servers, and the generated video with audio will return to your ComfyUI canvas. You will be able to chain it with local processing nodes or other API models in the same workflow.

Note: Gemini Omni is very new. Omni Flash launched May 19, 2026 with video output only. Image output and audio output are coming in later releases. Resolution is reported at 720p/24fps by third parties; Google has not published official specs. The 10-second duration cap is a deployment decision, not a model limitation. Developer API access is confirmed but not yet available. Full Gemini Omni Pro (higher quality variant) has no committed release date. Content filtering and avatar safety controls are active. All outputs include SynthID watermarks.

Frequently Asked Questions

Common questions about running Gemini Omni on Floyo.

Is Gemini Omni free to use on Floyo?

You can start with Floyo's free pricing plan. Floyo gives $0.25 in free API credits on signup. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. Gemini Omni will run as an API node, so generation costs will come from your API Wallet (separate from your plan's GPU time).

How do I run Gemini Omni without installing anything?

Once available on Floyo, open the platform in your browser, find a Gemini Omni workflow (search "Gemini Omni" in the template library), and click Run. Provide your inputs and generate. Floyo handles the ComfyUI environment and API connection. No local install, no Python setup, no API key management.

Who made Gemini Omni?

Google DeepMind. Gemini Omni was announced by CTO Koray Kavukcuoglu at Google I/O 2026 on May 19. Gemini Omni Flash is the first model in the family. It launched simultaneously on the Gemini app, Google Flow, YouTube Shorts, and YouTube Create App. Developer API access is confirmed for "the coming weeks."

What is the difference between Gemini Omni and Veo?

Veo is a dedicated video generation model: prompt it, get a clip, re-prompt for changes. Gemini Omni fuses Gemini reasoning with Veo rendering and Genie world simulation into one system. Omni understands context, physics, and intent across multi-turn edits. It can edit existing generations without starting over. Veo cannot. Gemini Omni Flash replaces Veo inside the Gemini app.

Can I edit a video after generating it?

Yes. This is Gemini Omni's headline feature. Generate a video, then give editing instructions in plain English: "make the sky more dramatic," "add a person walking," "change the music mood." The model edits without regenerating from scratch. Character consistency, physics, and lighting persist across rounds. No other video model supports this.

Can I combine Gemini Omni with other AI models in one workflow?

Yes, once the API is available on Floyo. Generate characters with Nano Banana, animate with Gemini Omni, add narration with Fish Audio S2 or ElevenLabs, upscale with Topaz. All in one ComfyUI pipeline.

Can I generate video of myself with Gemini Omni?

Yes, with a safety check. You must record yourself speaking a number sequence (avatar onboarding). This spoken-number handshake verifies your identity and prevents unauthorized deepfake generation. The control is enforced at the model level. Once verified, you can create video featuring your own likeness.

When will Gemini Omni be available on Floyo?

Gemini Omni is coming soon to Floyo as a ComfyUI API node. Google confirmed developer API access is "coming in the coming weeks" as of the May 19 announcement. Floyo will integrate once the API is available. Check back for updates or sign up to be notified when the workflow goes live.

Gemini Omni is Coming to Floyo

Google DeepMind's unified reasoning-powered video generation with conversational editing, any-input flexibility, and physics-aware world simulation. Run it in your browser.

Coming Soon to Floyo → Browse All Models