Meta: Llama Guard 4 12B
Llama Guard 4 12B is a multimodal model from Meta that accepts both image and text inputs, with a context window of 163,840 tokens and a maximum output of 16,384 tokens. It does not support tool use, reasoning modes, or structured output, which positions it as a focused inference model rather than a general-purpose agent. Its design suggests a content safety or classification orientation, though the scope of tasks it handles well cannot be confirmed from available data alone. At $0.18 per million tokens for both input and output, the pricing is competitive for a multimodal model, but there is currently no independent benchmark coverage to validate performance claims. Buyers who need a low-cost option for image-and-text processing pipelines may want to shortlist it, but without benchmark scores, any quality assumptions remain unproven. Teams with strict performance requirements should treat this model as unvalidated until third-party evaluations are available.
- Model ID
- meta-llama/llama-guard-4-12b
- Vendor
- meta-llama
- Tokenizer
- Other
- Input Modalities
- image, text
- Output Modalities
- text
- Max Output
- 16,384 tokens
- Tool Calling
- not supported
- Structured Output
- ✓ supported
- Reasoning Mode
- not supported
- Vision
- ✓ accepts images
- Audio
- no
- Moderated
- no