Gemma 4 E4B: Ask Anything About a Video
Upload a video, ask a question or give an instruction, and get a written answer from Gemma 4 E4B, Google DeepMind's open-weights multimodal model. It reads the frames and the audio track together.
gemma 4
google deepmind
multimodal llm
video to text
video understanding
0
13
Nodes & Models
CLIPLoader
gemma4_e4b_it_fp8_scaled.safetensors
LoadVideo
GetVideoComponents
TextGenerate
PreviewAny
ABOUT THE WORKFLOW
Read a Video and Answer
Upload a video and type a question or instruction. The model looks at frames across the clip and listens to the audio track, then answers in plain language. One run covers both what is shown and what is said.
Model
Gemma 4 E4B by Google DeepMind. An open-weights multimodal model built from Gemini 3 research that reads text, image, video, and audio and writes a text response. Strong at description, analysis, transcription, and question answering, with a configurable thinking mode.
HOW IT WORKS
Step 1. Upload your video
The clip you want the model to watch. It samples frames across the video and reads the audio track on its own, so you do not need to add audio separately.
Works great with: clips · screen recordings · interviews · tutorials
Step 2. Write your prompt
Ask a question or give an instruction in plain language. Example: Summarize what happens in the clip and transcribe the dialogue.
Step 3. Hit run and read the answer
The model returns a text response in the workflow. Preview it, then copy it out.
Ready for: summaries · captions · notes
First time? Leave every setting as-is. The defaults are the right starting point for almost everyone.
RECOMMENDED SETTINGS
Quick-start guide. Find the goal that matches yours and copy the settings.
Standard answer (most people). Max length 2048 · temperature 0.7 · thinking off · random seed. The right starting point for almost everyone.
Want a longer summary or full transcript. Raise the max length so the answer is not cut short on a longer clip.
Want it to reason through a hard question. Turn thinking on so the model works step by step before answering. Slower, better for complex prompts.
Want to reproduce the same answer. Set a specific seed instead of random.
The answer misses parts of the video. Be specific in your prompt about which moments or details you want covered.
The output repeats or drifts. Lower the temperature, or tighten the prompt around a single task.
Prompt: Name the task and the format you want back. List each scene with a timestamp and a one-line description gives a cleaner result than tell me about this video. You can ask for visuals and audio in one prompt, like a summary plus a transcript.
LEARN
📹 Videos
ComfyUI 101 Free Course ft. Sebastian Kamph
Floyo 101 for Team Collaboration
✨ Quick links
USE CASES
🎬 Video Summaries
Turn a clip into a written summary of what happens, scene by scene.
🗣️ Dialogue Transcription
Pull a transcript of the spoken audio straight from the video, with no separate audio file.
🔍 Scene & Action Analysis
Ask what someone does, count events, or check whether a moment appears in the footage.
♿ Captions & Descriptions
Generate captions and described-video text for accessibility from a single run.
🌍 Multilingual Clips
Read and answer about footage across many languages, including translating on-screen or spoken text.
WHAT WORKS BEST / WHAT TO AVOID
âś… Works great
Clear, specific questions and instructions
Short to medium clips with readable detail
Clean audio for transcription
Footage with a single main subject or scene
⚠️ May produce softer results
Vague prompts like "describe this video"
Very long clips, where frames are sampled more sparsely
Dark, blurry, or low-resolution footage
Overlapping speakers or noisy audio
FAQ
What is Gemma 4 E4B?
Gemma 4 E4B is one of the small models in Google DeepMind's open-weights Gemma 4 family, built from Gemini 3 research. Gemma 4 models are multimodal, handling text and image input, with audio supported natively on the E2B, E4B, and 12B models, and generating text output. The E4B size is tuned to run efficiently rather than at frontier scale. Google AI
Can Gemma 4 understand video and audio together?
Yes. Gemma 4 processes text, image, video, and audio, with audio featured natively on the E2B, E4B, and 12B models. This workflow feeds it both the sampled frames and the audio track, so a single answer can cover what is on screen and what is said. Google AI
How does the model read a video?
The workflow pulls frames from across the clip and separates the audio, then passes both to the model alongside your prompt. For best results with multimodal inputs, place the image or audio content before the text in your prompt. Longer clips are sampled more sparsely, so shorter clips give a denser read. Ollama
What is the thinking mode in Gemma 4?
Thinking is a built-in reasoning mode. It is a step-by-step reasoning mode that lets the model think before answering, and all models in the family are designed as capable reasoners with configurable thinking modes. Turn it on for complex questions and leave it off for quick answers. Google AI
What languages does Gemma 4 support?
Gemma 4 maintains multilingual support in over 140 languages. That covers transcribing, answering, and translating spoken or on-screen text in a clip. LM Studio
Can I use Gemma 4 results commercially?
Gemma 4 is released as open-weights, and the model card lists it under the Apache 2.0 license, so commercial use is generally allowed. Review Google's current Gemma terms before shipping, and make sure you hold the rights to any video you upload.
How to run Gemma 4 online?
You can run Gemma 4 online through Floyo. No installation, no setup, no API key to wire up. Open the workflow in your browser, upload a video, write your prompt, and hit run. Free to try.
WHY FLOYO?
Floyo is the only platform with team collaboration for ComfyUI in the browser. You run workflows with no install. You share run history, assets, and models across your team. You pay only when you generate. Floyo supports open-source and closed-source models.
An editor runs an analysis and likes the result. A teammate opens that exact run from shared history and keeps going. No file handoffs. No version confusion.
For studios and enterprise teams, Floyo adds private workspaces, pooled resources, and a team usage dashboard. Other ComfyUI cloud tools run for one person at a time. Floyo runs for the whole team, with transparent per-generation costs.
Ready to try it?
Upload a video, write your question, and run it. The settings are already set.
Questions? Watch the free course or check the FAQ above.
Read more








