Skip to main content

Performance and Limits

Understand rate limits, service tiers, and infrastructure capabilities for Hyperbolic’s Serverless Inference API.

Rate Limits

Standard Limits

TierRequests/MinuteRequirements
Basic60Free account
Pro600$5+ deposit
EnterpriseUnlimitedContact sales
All tiers have a per-IP limit of 600 requests/minute for DDoS protection.

Model-Specific Limits

Some resource-intensive models have special rate limits:
ModelBasicPro
Llama 3.1 405B5/min120/min
Llama 3.1 405B-Instruct5/min120/min
FLUX.1-dev1/5 min50/min

Upgrading to Pro

Get 10x higher rate limits by upgrading to Pro:
1

Log into Dashboard

Go to app.hyperbolic.ai and sign in.
2

Add Funds

Deposit $5 or more to your account.
3

Automatic Upgrade

Your account is automatically upgraded to Pro tier.

Service Tiers

FeatureBasicProEnterprise
Rate Limit60/min600/minUnlimited
CostFree$5+ depositCustom
SupportCommunity DiscordEmail24/7 dedicated
Priority Queue-YesYes
Dedicated Instances--Yes
Custom SLAs--Yes
Fine-tuning--Yes
Basic tier includes $1 promotional credit when you verify your phone number.
Need higher limits or dedicated infrastructure? Contact sales

Pricing Summary

Hyperbolic uses pay-as-you-go pricing with no monthly quotas or commitments.

Text Generation

Model CategoryPrice
Small models (3B-8B)From $0.10 per 1M tokens
Medium models (32B-72B)$0.20 - $0.40 per 1M tokens
Large models (120B-480B)$0.30 - $4.00 per 1M tokens

Image Generation

Base rate: $0.01 per image (1024x1024, 25 steps) Formula: $0.01 × (width/1024) × (height/1024) × (steps/25)

Audio Generation

Rate: $5.00 per 1M characters
See Text APIs, Image APIs, and Audio APIs for complete pricing by model.

Infrastructure

Security

FeatureDescription
Zero Data RetentionYour prompts and responses are never stored
EncryptionTLS 1.3 for all API connections
ComplianceSOC2 compliance (Enterprise tier)

Error Handling

Rate Limit Errors

When you exceed rate limits, you’ll receive a 429 Too Many Requests response:
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Please retry after X seconds."
  }
}

Best Practices

  • Implement exponential backoff for automatic retries
  • Monitor usage via the dashboard to stay within limits
  • Cache responses when appropriate to reduce API calls
  • Use streaming for long responses to improve perceived latency

Retry Example

import time
from openai import RateLimitError

def call_with_retry(func, max_retries=3):
    """Call a function with exponential backoff on rate limit errors."""
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt  # 1, 2, 4 seconds
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

# Usage
response = call_with_retry(
    lambda: client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct",
        messages=[{"role": "user", "content": "Hello!"}]
    )
)

Monitoring Usage

Track your API usage in the Hyperbolic Dashboard:
  • Requests per minute/hour/day
  • Token consumption by model
  • Cost breakdown and billing history
  • Real-time usage graphs

Next Steps