Qwen: Qwen2.5 VL 72B Instruct
Qwen2.5 VL 72B Instruct is a multimodal model from Qwen that accepts both text and image inputs, making it suitable for tasks that require visual understanding alongside language processing. It offers a 131K-token context window and supports up to 128K output tokens, which accommodates long-form generation. The model does not support tool use, reasoning modes, or structured output, so workflows that depend on those features will need to look elsewhere. At $0.80 per million input tokens and $1.00 per million output tokens, it sits in a mid-range price band for a large vision-language model. The practical catch is that there is currently no independent benchmark coverage to validate its real-world performance, so buyers are working without third-party evidence. Teams that need a capable vision and text model at a moderate price and are comfortable running their own evaluations may want to shortlist it; those who require proven benchmark scores before committing should wait for more coverage.
- Model ID
- qwen/qwen2.5-vl-72b-instruct
- Vendor
- qwen
- Tokenizer
- Qwen
- Input Modalities
- text, image
- Output Modalities
- text
- Max Output
- 128,000 tokens
- Tool Calling
- not supported
- Structured Output
- ✓ supported
- Reasoning Mode
- not supported
- Vision
- ✓ accepts images
- Audio
- no
- Moderated
- no