qwen

Qwen: Qwen3 VL 32B Instruct

Qwen3 VL 32B Instruct is a multimodal model from Qwen that accepts both text and image inputs, making it usable for tasks that involve visual content alongside written prompts. It supports a 262,144-token context window and tool use, though it does not offer native reasoning mode, and structured output support is unconfirmed. Maximum response length is capped at 32,768 tokens. On price, it sits at $0.104 per million input tokens and $0.416 per million output tokens, which is modest for a vision-capable model. However, its blended benchmark score of 20.4 covers only 3 benchmarks, so performance on tasks outside those sampled categories is largely unproven. Teams processing image-heavy workflows on a budget may want to shortlist it for that reason, but buyers who need confident benchmark coverage across coding, reasoning, or agentic tasks should treat the current scores as a limited signal and test it directly against their own workloads.

Quality Score
99/100
price + capability + benchmarks
Input Price
$0.10
per 1M tokens
Output Price
$0.42
per 1M tokens
Context Window
262,144
tokens
Model ID
qwen/qwen3-vl-32b-instruct
Vendor
qwen
Tokenizer
Qwen
Input Modalities
text, image
Output Modalities
text
Max Output
32,768 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
✓ accepts images
Audio
no
Moderated
no

Similar models