Skip to main content

Vision Language Models

Analyze images and documents using powerful vision-language models (VLMs). These multimodal models can understand image content, extract text, answer questions about visuals, and perform complex reasoning tasks—all through the same chat completions API.

Overview

Vision language models combine image understanding with natural language processing to enable:
  • Image Understanding: Describe and analyze image content
  • Document Analysis: Extract text from documents, receipts, and forms (OCR)
  • Visual Q&A: Answer questions about images
  • Image-based Reasoning: Perform complex analysis and comparisons

Endpoint

VLMs use the same chat completions endpoint as text models:
POST https://api.hyperbolic.xyz/v1/chat/completions

Basic Example

import base64
import requests
from PIL import Image
from io import BytesIO

def encode_image(image_path):
    """Encode an image file to base64 string."""
    with Image.open(image_path) as img:
        buffered = BytesIO()
        img.save(buffered, format="PNG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

# Encode your image
base64_image = encode_image("path/to/your/image.jpg")

url = "https://api.hyperbolic.xyz/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY"
}
data = {
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                }
            ]
        }
    ],
    "max_tokens": 512,
    "temperature": 0.1
}

response = requests.post(url, headers=headers, json=data)
print(response.json()["choices"][0]["message"]["content"])

Image Input Format

Encoding Images

Images must be base64-encoded before sending to the API. Here’s a helper function:
import base64
from PIL import Image
from io import BytesIO

def encode_image(image_path):
    """Encode an image file to base64 string."""
    with Image.open(image_path) as img:
        # Resize if larger than max resolution
        max_size = (2048, 2048)
        img.thumbnail(max_size, Image.Resampling.LANCZOS)
        
        buffered = BytesIO()
        img.save(buffered, format="PNG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

Message Format

When sending images, the content field becomes an array of content objects:
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image in detail."},
        {
          "type": "image_url",
          "image_url": {"url": "data:image/png;base64,{base64_string}"}
        }
      ]
    }
  ]
}

Limitations

  • Supported formats: JPG, PNG
  • Maximum resolution: 2048x2048 pixels
  • Images per request: 1

Multi-turn Conversations

You can ask follow-up questions about an image by maintaining conversation history:
import base64
import requests
from PIL import Image
from io import BytesIO

def encode_image(image_path):
    with Image.open(image_path) as img:
        buffered = BytesIO()
        img.save(buffered, format="PNG")
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

base64_image = encode_image("receipt.jpg")

url = "https://api.hyperbolic.xyz/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY"
}

# First turn: send the image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What items are on this receipt?"},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{base64_image}"}
            }
        ]
    }
]

response = requests.post(url, headers=headers, json={
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": messages,
    "max_tokens": 512
})

assistant_response = response.json()["choices"][0]["message"]["content"]
print("First response:", assistant_response)

# Second turn: follow-up question (no need to resend image)
messages.append({"role": "assistant", "content": assistant_response})
messages.append({"role": "user", "content": "What is the total amount?"})

response = requests.post(url, headers=headers, json={
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": messages,
    "max_tokens": 256
})

print("Follow-up response:", response.json()["choices"][0]["message"]["content"])

Available Models

ModelModel IDBest ForPrice
NVIDIA Nemotron Nano 12B v2 VLnvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16Document intelligence$0.20/M tokens
Pixtral 12Bmistralai/Pixtral-12B-2409Budget-friendly, general use$0.10/M tokens
Qwen2.5-VL-7B-InstructQwen/Qwen2.5-VL-7B-InstructBalanced cost/performance$0.20/M tokens
Qwen2.5-VL-72B-InstructQwen/Qwen2.5-VL-72B-InstructBest quality, complex analysis$0.60/M tokens

Model Recommendations

Choosing the right model:
  • Best quality: Qwen2.5-VL-72B-Instruct for complex analysis and detailed understanding
  • Best value: Pixtral 12B at $0.10/M tokens for general image tasks
  • Document analysis: NVIDIA Nemotron Nano for OCR, forms, and document intelligence
  • Balanced: Qwen2.5-VL-7B-Instruct for good performance at moderate cost

Use Cases

Document Analysis

Extract text and structured data from documents, receipts, and forms:
data = {
    "model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all text from this receipt and format it as a list with item names and prices."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    "max_tokens": 1024
}

Image Captioning

Generate detailed descriptions of images:
data = {
    "model": "Qwen/Qwen2.5-VL-72B-Instruct",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail, including colors, objects, and any text visible."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    "max_tokens": 512
}

Visual Q&A

Ask specific questions about image content:
data = {
    "model": "mistralai/Pixtral-12B-2409",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "How many people are in this photo? What are they doing?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
        ]
    }],
    "max_tokens": 256
}

Next Steps