Master the full lifecycle of GPU instance management on Hyperbolic - from creation to termination, including monitoring, scaling, and troubleshooting.
This guide covers managing instances through the Hyperbolic Web UI. For programmatic access and API support, please contact our enterprise sales team.
Creating Instances
Web UI Method
Navigate to On-Demand GPU platform
Select GPU Configuration
Choose GPU type (H100 80GB, H200 141GB)
Select quantity
Pick region for optimal latency
Choose InfiniBand if needed for multi-GPU
Configure Instance
Storage : Configure as needed for your storage needs
Label : Name your instance for easy identification
SSH Keys (Optional): Add SSH keys for secure access to your instance
Launch
Review pricing and click “Start Building”. Instance will be ready in a few minutes (may take up to 25 minutes depending on configuration and region).
Connecting to Instances
SSH Connection
Basic SSH connection:
# Standard connection
ssh ubuntu@ < instance-i p >
# With specific key
ssh -i ~/.ssh/hyperbolic_key ubuntu@ < instance-i p >
File Transfer
# Upload file
scp model.pth ubuntu@ < instance-i p > :/home/ubuntu/
# Download file
scp ubuntu@ < instance-i p > :/home/ubuntu/results.csv ./
# Upload directory
scp -r dataset/ ubuntu@ < instance-i p > :/home/ubuntu/
# Sync directory (upload)
rsync -avz --progress dataset/ ubuntu@ < instance-i p > :/home/ubuntu/dataset/
# Sync with deletion of removed files
rsync -avz --delete --progress local/ ubuntu@ < instance-i p > :/home/ubuntu/remote/
# Resume interrupted transfer
rsync -avz --partial --progress large_file.tar ubuntu@ < instance-i p > :/data/
# Interactive SFTP session
sftp ubuntu@ < instance-i p >
# SFTP commands
sftp > put model.pth
sftp > get results/
sftp > ls
sftp > pwd
sftp > exit
Instance Lifecycle Management
Terminating Instances
Termination is permanent. All data on the instance will be lost. Always backup important data before terminating.
To terminate an instance:
Go to “My Instances” in the Web UI
Click “Terminate” in the instance actions
Confirm the deletion
Monitoring and Metrics
Real-Time GPU Monitoring
SSH into your instance and use these commands:
# Basic GPU status
nvidia-smi
# Continuous monitoring (updates every 1 second)
watch -n 1 nvidia-smi
# Detailed GPU metrics
nvidia-smi -q
# Show running processes
nvidia-smi pmon
# GPU utilization over time
nvidia-smi dmon
System Monitoring
# System resources
htop
# Disk usage
df -h
# Memory usage
free -h
# Network statistics
nethogs # Install with: sudo apt install nethogs
# Process monitoring
ps aux | grep python
Setting Up Custom Monitoring
Prometheus + Grafana
Weights & Biases
# Install DCGM exporter for GPU metrics
docker run -d --gpus all --rm -p 9400:9400 nvidia/dcgm-exporter:latest
# Install Prometheus
docker run -d -p 9090:9090 \
-v prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Install Grafana
docker run -d -p 3000:3000 grafana/grafana
import wandb
import torch
# Initialize W&B
wandb.init( project = "hyperbolic-training" )
# Log GPU metrics automatically
wandb.watch(model)
# Custom GPU logging
for epoch in range (num_epochs):
gpu_utilization = torch.cuda.utilization()
gpu_memory = torch.cuda.memory_allocated() / 1024 ** 3
wandb.log({
"gpu_utilization" : gpu_utilization,
"gpu_memory_gb" : gpu_memory,
"epoch" : epoch
})
Managing Multiple Instances
Viewing All Instances
Go to app.hyperbolic.ai/instances to view all your instances with:
Instance status
GPU type and configuration
Region
Running time and costs
Quick action buttons
Batch Operations
Parallel SSH Commands :
# Run command on multiple instances
for ip in 192.168.1.10 192.168.1.11 192.168.1.12 ; do
ssh ubuntu@ $ip "nvidia-smi" &
done
wait
# Using GNU parallel
parallel -j 4 ssh ubuntu@{} "python train.py" ::: \
instance1.hyperbolic.ai \
instance2.hyperbolic.ai \
instance3.hyperbolic.ai
Using Ansible :
# inventory.yml
all :
hosts :
h100-1 :
ansible_host : 192.168.1.10
h100-2 :
ansible_host : 192.168.1.11
h100-3 :
ansible_host : 192.168.1.12
vars :
ansible_user : ubuntu
ansible_ssh_private_key_file : ~/.ssh/hyperbolic_key
# Run command on all instances
ansible all -i inventory.yml -m shell -a "nvidia-smi"
Instance States and Troubleshooting
Instance States
State Description Billing Pending Instance is being provisioned No Running Instance is active and accessible Yes Starting Instance is booting up No Terminating Instance is being deleted No Failed Instance failed to start No
Common Issues and Solutions
Instance stuck in 'Pending' state
Solution :
Wait up to 25 minutes for provisioning
Check region availability in the dashboard
Contact support if pending > 30 minutes
Check the instance status in your dashboard for updates.
Solution :
Verify instance is in “Running” state
Check SSH key is authorized
Confirm IP address is correct
Test network connectivity
# Debug SSH connection
ssh -vvv ubuntu@ < instance-i p >
GPU not available in instance
Solution :
Verify GPU driver is loaded
Check Docker runtime configuration
Restart instance if needed
# Check GPU availability
nvidia-smi
# Check driver
nvidia-smi -q | grep "Driver Version"
# For Docker containers
docker run --gpus all nvidia/cuda:12.2.0-base nvidia-smi
Instance automatically terminated
Possible causes :
Insufficient account balance
Violation of terms of service
Hardware failure (rare)
Solution :
Check billing status in your dashboard
Review instance logs
Contact support for clarification
Still having issues? Contact [email protected] or use the Intercom chat widget for immediate assistance.
Best Practices
Use Instance Labels Name your instances clearly to track different projects and experiments.
Implement Auto-shutdown Set up scripts to automatically terminate idle instances to avoid unnecessary charges.
Regular Backups Schedule regular backups of important data to external storage (S3, GCS, etc.).
Monitor Costs Set up billing alerts and regularly review instance usage to optimize costs.
Auto-shutdown Script Example
#!/usr/bin/env python3
import subprocess
import time
import sys
def get_gpu_utilization ():
"""Get current GPU utilization percentage"""
result = subprocess.run(
[ 'nvidia-smi' , '--query-gpu=utilization.gpu' , '--format=csv,noheader,nounits' ],
capture_output = True ,
text = True
)
return int (result.stdout.strip())
def auto_shutdown ( idle_minutes = 30 , threshold = 5 ):
"""Shutdown instance if GPU idle for specified minutes"""
idle_count = 0
check_interval = 60 # Check every minute
while True :
util = get_gpu_utilization()
if util < threshold:
idle_count += 1
print ( f "GPU idle ( { util } %). Idle count: { idle_count } / { idle_minutes } " )
if idle_count >= idle_minutes:
print ( "Shutting down due to inactivity..." )
subprocess.run([ 'sudo' , 'shutdown' , '-h' , 'now' ])
sys.exit( 0 )
else :
idle_count = 0
print ( f "GPU active ( { util } %). Reset idle counter." )
time.sleep(check_interval)
if __name__ == "__main__" :
auto_shutdown( idle_minutes = 30 , threshold = 5 )
Advanced Configuration
Custom Startup Scripts
Create a startup script that runs when instance boots:
#!/bin/bash
# /home/ubuntu/startup.sh
# Mount additional storage
sudo mount /dev/nvme1n1 /data
# Start Jupyter
jupyter notebook --no-browser --port=8888 &
# Start TensorBoard
tensorboard --logdir=/home/ubuntu/logs --port=6006 &
# Start monitoring
python /home/ubuntu/monitor_gpu.py &
Data Persistence Strategies
# Install AWS CLI
sudo apt-get update
sudo apt-get install awscli -y
# Configure AWS credentials
aws configure
# Sync data to S3
aws s3 sync /home/ubuntu/models/ s3://my-bucket/models/
# Download from S3
aws s3 sync s3://my-bucket/dataset/ /home/ubuntu/dataset/
# Install Git LFS
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
# Initialize Git LFS
git lfs install
# Track large files
git lfs track "*.pth"
git lfs track "*.h5"
# Push to repository
git add .
git commit -m "Save model checkpoint"
git push
Next Steps