Qwen3-VL Image and Video Captioning

Upload an image or video and get a detailed text description from Qwen3-VL. Choose your model size, pick a preset prompt, or write your own. Runs in your browser.

Captioning

LLM

Prompt Generator

Qwen3VL

VLM

176

Frame 1216103653_1774001052331_1775019056002.webp

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI-VideoHelperSuite

VHS_LoadVideo

ComfyUI-S3-IO

VHS_LoadVideo

ComfyUI Official

AILab_LoadImage

MarkdownNote

WorkflowGraphics

AILab_QwenVL_Advanced

AILab_QwenVL

PreviewAny

Description:

Get text descriptions of images and videos using Qwen3-VL, Alibaba's vision-language model.

Upload an image, pick a preset prompt like "Detailed Description" or write your own question, and hit run. Qwen3-VL looks at your image and writes back a text response. Works with video too through the Advanced node. Supports model sizes from 2B to 32B, with optional quantization to save VRAM.

How do you caption images with Qwen3-VL?

Upload your image, choose a model size, and select a preset prompt or write a custom one. Qwen3-VL analyzes what's in the image and returns a text description. You can control output length, switch between model sizes for speed vs. accuracy, and use quantization to run larger models on less VRAM.

Model Name This picks which Qwen3-VL model to load. The 2B model is fast and light. The 4B and 8B models give more detailed, accurate descriptions. Want speed? Stick with 2B. Need better accuracy on complex images? Move up to 4B or 8B. The 32B model gives the best results but needs more VRAM.

Quantization Controls how much the model gets compressed to fit in memory. "None (FP16)" runs the full-precision model. "8-bit" cuts memory use in half with minimal quality loss. "4-bit" cuts it further but descriptions may lose some nuance. Start with 8-bit if you're unsure.

Preset Prompt A dropdown of ready-made prompts for common tasks. "Detailed Description" gives you a thorough breakdown of what's in the image. Other presets target specific tasks like OCR or object identification. Pick the one closest to what you need, or ignore it and write your own.

Custom Prompt Overrides the preset. Write any question or instruction. "What objects are on the table?" or "Describe the mood of this photo" or "Read all the text in this image." The model follows your instruction and responds in text.

Max Tokens How long the response can be. Default is 512. Want a quick one-liner? Set it to 64. Need a paragraph-level breakdown? Push it to 1024 or higher. Max is 2048.

Temperature (Advanced node) Controls how creative or deterministic the output is. Default is 0.6. Want consistent, predictable captions? Lower it toward 0.1. Want more varied or expressive descriptions? Push toward 0.9. Only works when num_beams is set to 1.

Top P (Advanced node) Nucleus sampling threshold. Default 0.9. Lower values make the model pick from fewer word choices, giving tighter output. Higher values let it explore more options. Works alongside temperature when num_beams is 1.

What is Qwen3-VL captioning good for?

Qwen3-VL captioning works for generating alt text, creating image descriptions for datasets, reading text from photos (OCR), answering questions about what's in an image, and analyzing video content frame by frame. It handles documents, charts, UI screenshots, and natural photos.

Use this when you need text from visuals. Product photography teams can auto-generate descriptions for catalog images. Dataset builders can caption thousands of training images. Accessibility teams can generate alt text at scale.

Qwen3-VL also reads text in images across 32 languages, so it handles receipts, signs, documents, and handwritten notes. If you have a chart or UI screenshot, it can describe the layout and content.

The video path (through the Advanced node) samples frames from a video and describes what's happening across them. Good for logging video content or building searchable metadata.

For single quick questions about one image, the Standard node with the 2B model is the fastest path. For batch work or detailed analysis, step up to a larger model and use the Advanced node for finer control over generation settings.

FAQ

What Qwen3-VL model size should I use for image captioning? The 2B model runs fastest and works well for short captions and quick descriptions. The 4B and 8B models produce more detailed, accurate output for complex scenes or documents. The 32B model gives the best results but needs more GPU memory. Start with 2B and move up if your descriptions need more depth.

Can Qwen3-VL read text in images (OCR)? Yes. Qwen3-VL supports OCR across 32 languages. It reads printed text, handwritten notes, receipts, signs, and documents. Use the preset prompt for OCR or write a custom prompt like "Read all text in this image." It handles low light, blur, and angled text.

What is the difference between the Standard and Advanced Qwen3-VL nodes? The Standard node covers the basics: model selection, prompt, and max tokens. The Advanced node adds temperature, top_p, num_beams, and repetition_penalty for fine-tuning how the model generates text. Most users only need the Standard node. Switch to Advanced when you want tighter control over output style or consistency.

Can Qwen3-VL describe videos? Yes. The Advanced node accepts video input and samples frames (configurable, default 16 frames) to analyze what's happening over time. Connect a video loader to the Advanced node's video input and set the frame count based on your video length.

How to run Qwen3-VL captioning online? You can run Qwen3-VL captioning online through Floyo. No installation, no setup. Open the workflow in your browser, upload your inputs, and hit run. Free to try.