Managing Instances

Master the full lifecycle of GPU instance management on Hyperbolic - from creation to termination, including monitoring, scaling, and troubleshooting.

This guide covers managing instances through the Hyperbolic Web UI. For programmatic access and API support, please contact our enterprise sales team.

Creating Instances

Web UI Method

Navigate to On-Demand GPU platform

Go to app.hyperbolic.ai/gpus and browse available GPUs.

Select GPU Configuration

Choose GPU type (H100 80GB, H200 141GB)
Select quantity
Pick region for optimal latency
Choose InfiniBand if needed for multi-GPU

Configure Instance

Storage: Configure as needed for your storage needs
Label: Name your instance for easy identification
SSH Keys (Optional): Add SSH keys for secure access to your instance

Launch

Review pricing and click “Start Building”. Instance will be ready in a few minutes (may take up to 25 minutes depending on configuration and region).

Connecting to Instances

SSH Connection

Basic SSH connection:

# Standard connection
ssh ubuntu@<instance-ip>

# With specific key
ssh -i ~/.ssh/hyperbolic_key ubuntu@<instance-ip>

File Transfer

SCP
Rsync
SFTP

# Upload file
scp model.pth ubuntu@<instance-ip>:/home/ubuntu/

# Download file
scp ubuntu@<instance-ip>:/home/ubuntu/results.csv ./

# Upload directory
scp -r dataset/ ubuntu@<instance-ip>:/home/ubuntu/

# Sync directory (upload)
rsync -avz --progress dataset/ ubuntu@<instance-ip>:/home/ubuntu/dataset/

# Sync with deletion of removed files
rsync -avz --delete --progress local/ ubuntu@<instance-ip>:/home/ubuntu/remote/

# Resume interrupted transfer
rsync -avz --partial --progress large_file.tar ubuntu@<instance-ip>:/data/

# Interactive SFTP session
sftp ubuntu@<instance-ip>

# SFTP commands
sftp> put model.pth
sftp> get results/
sftp> ls
sftp> pwd
sftp> exit

Instance Lifecycle Management

Terminating Instances

Termination is permanent. All data on the instance will be lost. Always backup important data before terminating.

To terminate an instance:

Go to “My Instances” in the Web UI
Click “Terminate” in the instance actions
Confirm the deletion

Monitoring and Metrics

Real-Time GPU Monitoring

SSH into your instance and use these commands:

# Basic GPU status
nvidia-smi

# Continuous monitoring (updates every 1 second)
watch -n 1 nvidia-smi

# Detailed GPU metrics
nvidia-smi -q

# Show running processes
nvidia-smi pmon

# GPU utilization over time
nvidia-smi dmon

System Monitoring

# System resources
htop

# Disk usage
df -h

# Memory usage
free -h

# Network statistics
nethogs  # Install with: sudo apt install nethogs

# Process monitoring
ps aux | grep python

Setting Up Custom Monitoring

Prometheus + Grafana
Weights & Biases

# Install DCGM exporter for GPU metrics
docker run -d --gpus all --rm -p 9400:9400 nvidia/dcgm-exporter:latest

# Install Prometheus
docker run -d -p 9090:9090 \
  -v prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Install Grafana
docker run -d -p 3000:3000 grafana/grafana

import wandb
import torch

# Initialize W&B
wandb.init(project="hyperbolic-training")

# Log GPU metrics automatically
wandb.watch(model)

# Custom GPU logging
for epoch in range(num_epochs):
    gpu_utilization = torch.cuda.utilization()
    gpu_memory = torch.cuda.memory_allocated() / 1024**3

    wandb.log({
        "gpu_utilization": gpu_utilization,
        "gpu_memory_gb": gpu_memory,
        "epoch": epoch
    })

Managing Multiple Instances

Viewing All Instances

Go to app.hyperbolic.ai/instances to view all your instances with:

Instance status
GPU type and configuration
Region
Running time and costs
Quick action buttons

Batch Operations

Parallel SSH Commands:

# Run command on multiple instances
for ip in 192.168.1.10 192.168.1.11 192.168.1.12; do
    ssh ubuntu@$ip "nvidia-smi" &
done
wait

# Using GNU parallel
parallel -j 4 ssh ubuntu@{} "python train.py" ::: \
  instance1.hyperbolic.ai \
  instance2.hyperbolic.ai \
  instance3.hyperbolic.ai

Using Ansible:

# inventory.yml
all:
  hosts:
    h100-1:
      ansible_host: 192.168.1.10
    h100-2:
      ansible_host: 192.168.1.11
    h100-3:
      ansible_host: 192.168.1.12
  vars:
    ansible_user: ubuntu
    ansible_ssh_private_key_file: ~/.ssh/hyperbolic_key

# Run command on all instances
ansible all -i inventory.yml -m shell -a "nvidia-smi"

Instance States and Troubleshooting

Instance States

State	Description	Billing
Pending	Instance is being provisioned	No
Running	Instance is active and accessible	Yes
Starting	Instance is booting up	No
Terminating	Instance is being deleted	No
Failed	Instance failed to start	No

Common Issues and Solutions

Instance stuck in 'Pending' state

Solution:

Wait up to 25 minutes for provisioning
Check region availability in the dashboard
Contact support if pending > 30 minutes

Check the instance status in your dashboard for updates.

Cannot SSH to instance

Solution:

Verify instance is in “Running” state
Check SSH key is authorized
Confirm IP address is correct
Test network connectivity

# Debug SSH connection
ssh -vvv ubuntu@<instance-ip>

GPU not available in instance

Solution:

Verify GPU driver is loaded
Check Docker runtime configuration
Restart instance if needed

# Check GPU availability
nvidia-smi

# Check driver
nvidia-smi -q | grep "Driver Version"

# For Docker containers
docker run --gpus all nvidia/cuda:12.2.0-base nvidia-smi

Instance automatically terminated

Possible causes:

Insufficient account balance
Violation of terms of service
Hardware failure (rare)

Solution:

Check billing status in your dashboard
Review instance logs
Contact support for clarification

Still having issues? Contact [email protected] or use the Intercom chat widget for immediate assistance.

Best Practices

Use Instance Labels

Name your instances clearly to track different projects and experiments.

Implement Auto-shutdown

Set up scripts to automatically terminate idle instances to avoid unnecessary charges.

Regular Backups

Schedule regular backups of important data to external storage (S3, GCS, etc.).

Monitor Costs

Set up billing alerts and regularly review instance usage to optimize costs.

Auto-shutdown Script Example

#!/usr/bin/env python3
import subprocess
import time
import sys

def get_gpu_utilization():
    """Get current GPU utilization percentage"""
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader,nounits'],
        capture_output=True,
        text=True
    )
    return int(result.stdout.strip())

def auto_shutdown(idle_minutes=30, threshold=5):
    """Shutdown instance if GPU idle for specified minutes"""
    idle_count = 0
    check_interval = 60  # Check every minute

    while True:
        util = get_gpu_utilization()

        if util < threshold:
            idle_count += 1
            print(f"GPU idle ({util}%). Idle count: {idle_count}/{idle_minutes}")

            if idle_count >= idle_minutes:
                print("Shutting down due to inactivity...")
                subprocess.run(['sudo', 'shutdown', '-h', 'now'])
                sys.exit(0)
        else:
            idle_count = 0
            print(f"GPU active ({util}%). Reset idle counter.")

        time.sleep(check_interval)

if __name__ == "__main__":
    auto_shutdown(idle_minutes=30, threshold=5)

Advanced Configuration

Custom Startup Scripts

Create a startup script that runs when instance boots:

#!/bin/bash
# /home/ubuntu/startup.sh

# Mount additional storage
sudo mount /dev/nvme1n1 /data

# Start Jupyter
jupyter notebook --no-browser --port=8888 &

# Start TensorBoard
tensorboard --logdir=/home/ubuntu/logs --port=6006 &

# Start monitoring
python /home/ubuntu/monitor_gpu.py &

Data Persistence Strategies

Cloud Storage
Git LFS

# Install AWS CLI
sudo apt-get update
sudo apt-get install awscli -y

# Configure AWS credentials
aws configure

# Sync data to S3
aws s3 sync /home/ubuntu/models/ s3://my-bucket/models/

# Download from S3
aws s3 sync s3://my-bucket/dataset/ /home/ubuntu/dataset/

# Install Git LFS
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs

# Initialize Git LFS
git lfs install

# Track large files
git lfs track "*.pth"
git lfs track "*.h5"

# Push to repository
git add .
git commit -m "Save model checkpoint"
git push

Overview

On-Demand GPU

Serverless Inference

Reserved Clusters

General Platform

Creating Instances

Web UI Method

Connecting to Instances

SSH Connection

File Transfer

Instance Lifecycle Management

Terminating Instances

Monitoring and Metrics

Real-Time GPU Monitoring

System Monitoring

Setting Up Custom Monitoring

Managing Multiple Instances

Viewing All Instances

Batch Operations

Instance States and Troubleshooting

Instance States

Common Issues and Solutions

Best Practices

Use Instance Labels

Implement Auto-shutdown

Regular Backups

Monitor Costs

Auto-shutdown Script Example

Advanced Configuration

Custom Startup Scripts

Data Persistence Strategies

Next Steps

Storage & Ports

Troubleshooting

Overview

On-Demand GPU

Serverless Inference

Reserved Clusters

General Platform

​Creating Instances

​Web UI Method

​Connecting to Instances

​SSH Connection

​File Transfer

​Instance Lifecycle Management

​Terminating Instances

​Monitoring and Metrics

​Real-Time GPU Monitoring

​System Monitoring

​Setting Up Custom Monitoring

​Managing Multiple Instances

​Viewing All Instances

​Batch Operations

​Instance States and Troubleshooting

​Instance States

​Common Issues and Solutions

​Best Practices

Use Instance Labels

Implement Auto-shutdown

Regular Backups

Monitor Costs

​Auto-shutdown Script Example

​Advanced Configuration

​Custom Startup Scripts

​Data Persistence Strategies

​Next Steps

Storage & Ports

Troubleshooting

Creating Instances

Web UI Method

Connecting to Instances

SSH Connection

File Transfer

Instance Lifecycle Management

Terminating Instances

Monitoring and Metrics

Real-Time GPU Monitoring

System Monitoring

Setting Up Custom Monitoring

Managing Multiple Instances

Viewing All Instances

Batch Operations

Instance States and Troubleshooting

Instance States

Common Issues and Solutions

Best Practices

Auto-shutdown Script Example

Advanced Configuration

Custom Startup Scripts

Data Persistence Strategies

Next Steps