Gemma 4 E4B: Ask Anything About an Image
Upload an image, ask a question or give an instruction, and get a written answer from Gemma 4 E4B, Google DeepMind's open-weights multimodal model. Add audio to transcribe or analyze it alongside the image.
gemma 4
google deepmind
image to text
multimodal llm
opensource
vision language
0
16
Nodes & Models
CLIPLoader
gemma4_e4b_it_fp8_scaled.safetensors
LoadAudio
LoadImage
TextGenerate
PreviewAny
ABOUT THE WORKFLOW
Read an Image and Answer
Upload an image and type a question or instruction. The model reads what is in the image and answers in plain language. You can also add an audio clip to transcribe or describe alongside the image.
Model
Gemma 4 E4B by Google DeepMind. An open-weights multimodal model built from Gemini 3 research that takes text, image, and audio and writes a text response. Strong at description, analysis, transcription, and question answering, with a configurable thinking mode.
HOW IT WORKS
Step 1. Upload your image
The image you want the model to read, describe, or answer questions about.
Works great with: photos · screenshots · charts · documents
Step 2. Write your prompt
Ask a question or give an instruction in plain language. Example: List every object on the desk and where each one sits.
Step 3. Add audio (optional)
Connect an audio file to have the model transcribe or respond to spoken content along with the image. Turned off by default.
Step 4. Hit run and read the answer
The model returns a text response in the workflow. Preview it, then copy it out.
Ready for: alt text · captions · summaries
First time? Leave every setting as-is. The defaults are the right starting point for almost everyone.
RECOMMENDED SETTINGS
Quick-start guide. Find the goal that matches yours and copy the settings.
Standard answer (most people). Max length 2048 · temperature 0.7 · thinking off · random seed. The right starting point for almost everyone.
Want shorter, tighter answers. Lower the max length and drop the temperature so the model stays focused and brief.
Want it to reason through a hard question. Turn thinking on so the model works step by step before answering. Slower, better for complex prompts.
Want more varied wording. Raise the temperature for looser, more creative output.
Want to reproduce the same answer. Set a specific seed instead of random.
The answer drifts or repeats. Lower the temperature, or make your prompt more specific about what you want back.
Prompt: Be specific about the task and the format you want. Transcribe the sign in the photo and translate it to French gives a cleaner result than what does this say. When you add audio, say whether you want a transcript, a summary, or an answer about it.
LEARN
📹 Videos
ComfyUI 101 Free Course ft. Sebastian Kamph
Floyo 101 for Team Collaboration
✨ Quick links
USE CASES
🖼️ Image Description & Alt Text
Generate a clear written description of any image for accessibility, cataloguing, or captions.
🔍 Visual Q&A
Ask what is happening in a photo, count objects, or pull details out of a busy scene.
🎙️ Audio Transcription
Add an audio clip and get a transcript or a summary of what was said.
📊 Chart & Document Reading
Point it at a screenshot of a chart, receipt, or page and ask it to read out the numbers or text.
🌍 Multilingual Tasks
Read and answer about content across many languages, including translation of text in an image.
WHAT WORKS BEST / WHAT TO AVOID
✅ Works great
Clear, specific questions and instructions
Single well-lit images with readable detail
Charts, screenshots, and documents
Clean audio for transcription
⚠️ May produce softer results
Vague prompts like "describe this"
Blurry, dark, or very low-resolution images
Crowded scenes with many overlapping objects
Noisy or muffled audio clips
FAQ
What is Gemma 4 E4B?
Gemma 4 E4B is one of the small models in Google DeepMind's open-weights Gemma 4 family. Gemma 4 models are multimodal, handling text and image input, with audio supported natively on the E2B, E4B, and 12B models, and generating text output. The family is built from Gemini 3 research, and the E4B size is designed to run efficiently rather than at frontier scale. Google AI
Can Gemma 4 understand images and audio?
Yes. The E4B model reads images and audio as native inputs and replies in text. For best results with multimodal inputs, place the image or audio content before the text in your prompt. That makes it a fit for description, visual question answering, and transcription in a single workflow. Ollama
What is the thinking mode in Gemma 4?
Thinking is a built-in reasoning mode. It is a step-by-step reasoning mode that lets the model think before answering, and all models in the family are designed as capable reasoners with configurable thinking modes. Turn it on for complex prompts and leave it off for quick, direct answers. Google AI
How is E4B different from the larger Gemma 4 models?
E4B is one of the edge sizes alongside E2B, while the family also includes larger 12B, 26B, and 31B models. The small models feature a 128K context window, while the medium models support 256K. The smaller sizes are tuned for efficient local execution, so E4B trades some raw capability for speed and a lighter footprint. Google AI
What languages does Gemma 4 support?
Gemma 4 maintains multilingual support in over 140 languages. That covers reading, answering, and translating text that appears inside an image. LM Studio
Can I use Gemma 4 results commercially?
Gemma 4 is released as open-weights, and the model card lists it under the Apache 2.0 license, so commercial use is generally allowed. Review Google's current Gemma terms before shipping, and make sure you hold the rights to any image or audio you upload.
How to run Gemma 4 online?
You can run Gemma 4 online through Floyo. No installation, no setup, no API key to wire up. Open the workflow in your browser, upload an image, write your prompt, and hit run. Free to try.
WHY FLOYO?
Floyo is the only platform with team collaboration for ComfyUI in the browser. You run workflows with no install. You share run history, assets, and models across your team. You pay only when you generate. Floyo supports open-source and closed-source models.
A researcher runs an analysis and likes the result. A teammate opens that exact run from shared history and keeps going. No file handoffs. No version confusion.
For studios and enterprise teams, Floyo adds private workspaces, pooled resources, and a team usage dashboard. Other ComfyUI cloud tools run for one person at a time. Floyo runs for the whole team, with transparent per-generation costs.
Ready to try it?
Upload an image, write your question, and run it. The settings are already set.
Questions? Watch the free course or check the FAQ above.
Read more








