qwen

Qwen: Qwen3 VL 30B A3B Instruct

Qwen3 VL 30B A3B Instruct is a multimodal model from Qwen that accepts both text and image inputs and supports tool use. Its context window reaches 262,144 tokens, which accommodates long documents or extended conversations, and it can return up to 32,768 tokens per response. Reasoning mode and structured output are not confirmed as supported features based on available data. At $0.13 per million input tokens and $0.52 per million output tokens, the model sits at a budget-friendly tier for a multimodal system with a large context window. However, there is currently no independent benchmark coverage to verify its performance against peers, so capability claims remain unproven in third-party testing. Teams with high image-plus-text volume who want low per-token costs may find it worth piloting, but anyone needing validated accuracy benchmarks before committing should wait for broader evaluation data to emerge.

Quality Score
99/100
price + capability + benchmarks
Input Price
$0.13
per 1M tokens
Output Price
$0.52
per 1M tokens
Context Window
262,144
tokens
Model ID
qwen/qwen3-vl-30b-a3b-instruct
Vendor
qwen
Tokenizer
Qwen3
Input Modalities
text, image
Output Modalities
text
Max Output
32,768 tokens
Tool Calling
✓ supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
✓ accepts images
Audio
no
Moderated
no

Similar models