Qwen: Qwen3 VL 32B Instruct
Qwen3 VL 32B Instruct is a multimodal model from Qwen that accepts both text and image inputs, making it usable for tasks that involve visual content alongside written prompts. It supports a 262,144-token context window and tool use, though it does not offer native reasoning mode, and structured output support is unconfirmed. Maximum response length is capped at 32,768 tokens. On price, it sits at $0.104 per million input tokens and $0.416 per million output tokens, which is modest for a vision-capable model. However, its blended benchmark score of 20.4 covers only 3 benchmarks, so performance on tasks outside those sampled categories is largely unproven. Teams processing image-heavy workflows on a budget may want to shortlist it for that reason, but buyers who need confident benchmark coverage across coding, reasoning, or agentic tasks should treat the current scores as a limited signal and test it directly against their own workloads.
- Model ID
- qwen/qwen3-vl-32b-instruct
- Vendor
- qwen
- Tokenizer
- Qwen
- Input Modalities
- text, image
- Output Modalities
- text
- Max Output
- 32,768 tokens
- Tool Calling
- ✓ supported
- Structured Output
- ✓ supported
- Reasoning Mode
- not supported
- Vision
- ✓ accepts images
- Audio
- no
- Moderated
- no