Qwen: Qwen3 VL 30B A3B Instruct
Qwen3 VL 30B A3B Instruct is a multimodal model from Qwen that accepts both text and image inputs and supports tool use. Its context window reaches 262,144 tokens, which accommodates long documents or extended conversations, and it can return up to 32,768 tokens per response. Reasoning mode and structured output are not confirmed as supported features based on available data. At $0.13 per million input tokens and $0.52 per million output tokens, the model sits at a budget-friendly tier for a multimodal system with a large context window. However, there is currently no independent benchmark coverage to verify its performance against peers, so capability claims remain unproven in third-party testing. Teams with high image-plus-text volume who want low per-token costs may find it worth piloting, but anyone needing validated accuracy benchmarks before committing should wait for broader evaluation data to emerge.
- Model ID
- qwen/qwen3-vl-30b-a3b-instruct
- Vendor
- qwen
- Tokenizer
- Qwen3
- Input Modalities
- text, image
- Output Modalities
- text
- Max Output
- 32,768 tokens
- Tool Calling
- ✓ supported
- Structured Output
- ✓ supported
- Reasoning Mode
- not supported
- Vision
- ✓ accepts images
- Audio
- no
- Moderated
- no