qwen

Qwen: Qwen2.5 VL 72B Instruct

Qwen2.5 VL 72B Instruct is a multimodal model from Qwen that accepts both text and image inputs, making it suitable for tasks that require visual understanding alongside language processing. It offers a 131K-token context window and supports up to 128K output tokens, which accommodates long-form generation. The model does not support tool use, reasoning modes, or structured output, so workflows that depend on those features will need to look elsewhere. At $0.80 per million input tokens and $1.00 per million output tokens, it sits in a mid-range price band for a large vision-language model. The practical catch is that there is currently no independent benchmark coverage to validate its real-world performance, so buyers are working without third-party evidence. Teams that need a capable vision and text model at a moderate price and are comfortable running their own evaluations may want to shortlist it; those who require proven benchmark scores before committing should wait for more coverage.

Quality Score
80/100
price + capability + benchmarks
Input Price
$0.80
per 1M tokens
Output Price
$1.00
per 1M tokens
Context Window
131,072
tokens
Model ID
qwen/qwen2.5-vl-72b-instruct
Vendor
qwen
Tokenizer
Qwen
Input Modalities
text, image
Output Modalities
text
Max Output
128,000 tokens
Tool Calling
not supported
Structured Output
✓ supported
Reasoning Mode
not supported
Vision
✓ accepts images
Audio
no
Moderated
no

Similar models