Vision Language Models
Analyze images and documents using powerful vision-language models (VLMs). These multimodal models can understand image content, extract text, answer questions about visuals, and perform complex reasoning tasks—all through the same chat completions API.Overview
Vision language models combine image understanding with natural language processing to enable:- Image Understanding: Describe and analyze image content
- Document Analysis: Extract text from documents, receipts, and forms (OCR)
- Visual Q&A: Answer questions about images
- Image-based Reasoning: Perform complex analysis and comparisons
Endpoint
VLMs use the same chat completions endpoint as text models:Basic Example
- Python
- cURL
Image Input Format
Encoding Images
Images must be base64-encoded before sending to the API. Here’s a helper function:Message Format
When sending images, thecontent field becomes an array of content objects:
Limitations
- Supported formats: JPG, PNG
- Maximum resolution: 2048x2048 pixels
- Images per request: 1
Multi-turn Conversations
You can ask follow-up questions about an image by maintaining conversation history:- Python
- cURL
Available Models
| Model | Model ID | Best For | Price |
|---|---|---|---|
| NVIDIA Nemotron Nano 12B v2 VL | nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 | Document intelligence | $0.20/M tokens |
| Pixtral 12B | mistralai/Pixtral-12B-2409 | Budget-friendly, general use | $0.10/M tokens |
| Qwen2.5-VL-7B-Instruct | Qwen/Qwen2.5-VL-7B-Instruct | Balanced cost/performance | $0.20/M tokens |
| Qwen2.5-VL-72B-Instruct | Qwen/Qwen2.5-VL-72B-Instruct | Best quality, complex analysis | $0.60/M tokens |

