Get Cluster Details

Overview

The Get Cluster Details endpoint provides comprehensive information about a specific GPU cluster, including real-time status, performance metrics, network configuration, cost information, and connection details. Essential for monitoring and managing individual clusters.

Endpoint

GET https://api.tensorone.ai/v1/clusters/{cluster_id}

Path Parameters

Parameter	Type	Required	Description
`cluster_id`	string	Yes	Unique cluster identifier

Query Parameters

Parameter	Type	Required	Description
`include_metrics`	boolean	No	Include real-time performance metrics (default: true)
`include_logs`	boolean	No	Include recent log entries (default: false)
`include_cost_breakdown`	boolean	No	Include detailed cost breakdown (default: true)
`metrics_window`	string	No	Metrics time window: `1h`, `6h`, `24h`, `7d` (default: `1h`)

Request Examples

# Get basic cluster information
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

# Get cluster with detailed metrics and cost breakdown
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123?include_metrics=true&include_cost_breakdown=true&metrics_window=24h" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

# Get cluster with recent logs
curl -X GET "https://api.tensorone.ai/v1/clusters/cluster_abc123?include_logs=true" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json"

Response Schema

{
  "success": true,
  "data": {
    "id": "cluster_abc123",
    "name": "ml-training-cluster",
    "description": "High-performance cluster for LLM training",
    "status": "running",
    "status_details": {
      "message": "Cluster is running normally",
      "last_status_change": "2024-01-15T14:35:00Z",
      "health_checks": {
        "gpu_health": "healthy",
        "storage_health": "healthy",
        "network_health": "healthy",
        "docker_health": "healthy"
      }
    },
    "configuration": {
      "gpu_type": "A100",
      "gpu_count": 4,
      "cpu_cores": 32,
      "memory_gb": 256,
      "storage_gb": 1000,
      "storage_type": "nvme",
      "region": "us-west-2",
      "availability_zone": "us-west-2a"
    },
    "project_info": {
      "project_id": "proj_456",
      "project_name": "ML Research Team",
      "owner_id": "user_789",
      "owner_email": "researcher@company.com"
    },
    "template_info": {
      "template_id": "tmpl_pytorch_latest",
      "template_name": "PyTorch 2.1 with CUDA 11.8",
      "template_version": "v2.1.0",
      "docker_image": "tensorone/pytorch:2.1-cuda11.8"
    },
    "network": {
      "private_ip": "10.0.1.15",
      "public_ip": "203.0.113.42",
      "proxy_url": "https://cluster-abc123.tensorone.ai",
      "ssh_connection": {
        "host": "ssh-abc123.tensorone.ai",
        "port": 22,
        "username": "root",
        "status": "connected",
        "connection_string": "ssh root@ssh-abc123.tensorone.ai",
        "last_connection": "2024-01-15T15:42:00Z"
      },
      "port_mappings": [
        {
          "internal_port": 8080,
          "external_port": 32001,
          "protocol": "tcp",
          "description": "Web Application",
          "url": "https://cluster-abc123.tensorone.ai:32001",
          "status": "active"
        },
        {
          "internal_port": 6006,
          "external_port": 32002,
          "protocol": "tcp",
          "description": "TensorBoard",
          "url": "https://cluster-abc123.tensorone.ai:32002",
          "status": "active"
        }
      ],
      "security_groups": ["sg_ml_training", "sg_ssh_access"],
      "firewall_rules": [
        {
          "direction": "inbound",
          "protocol": "tcp",
          "port_range": "22",
          "source": "0.0.0.0/0",
          "description": "SSH Access"
        }
      ]
    },
    "metrics": {
      "current": {
        "timestamp": "2024-01-15T15:45:00Z",
        "gpu_utilization": 87.3,
        "gpu_memory_utilization": 94.2,
        "cpu_utilization": 52.1,
        "memory_utilization": 68.4,
        "storage_utilization": 45.2,
        "network_rx_mbps": 125.3,
        "network_tx_mbps": 89.7,
        "temperature_celsius": 72.5,
        "power_usage_watts": 1250
      },
      "historical": {
        "window": "24h",
        "gpu_utilization": {
          "avg": 82.1,
          "min": 15.3,
          "max": 98.7,
          "trend": "increasing"
        },
        "memory_utilization": {
          "avg": 65.4,
          "min": 12.1,
          "max": 94.2,
          "trend": "stable"
        },
        "cost_efficiency": {
          "utilization_score": 85.2,
          "cost_per_compute_hour": 8.95
        }
      },
      "alerts": [
        {
          "type": "high_gpu_utilization",
          "severity": "info",
          "message": "GPU utilization consistently above 85%",
          "triggered_at": "2024-01-15T15:30:00Z"
        }
      ]
    },
    "cost": {
      "current_hourly_rate": 8.50,
      "currency": "USD",
      "session_cost": 68.25,
      "total_lifetime_cost": 284.75,
      "cost_breakdown": {
        "gpu_cost": 6.80,
        "cpu_cost": 0.85,
        "memory_cost": 0.45,
        "storage_cost": 0.25,
        "network_cost": 0.15
      },
      "billing_period": {
        "start": "2024-01-15T14:35:00Z",
        "current": "2024-01-15T15:45:00Z",
        "duration_hours": 1.17
      },
      "cost_projections": {
        "daily_estimate": 204.00,
        "weekly_estimate": 1428.00,
        "monthly_estimate": 6120.00
      }
    },
    "storage": {
      "volumes": [
        {
          "name": "root",
          "size_gb": 100,
          "used_gb": 45,
          "mount_path": "/",
          "type": "nvme",
          "encrypted": true
        },
        {
          "name": "data",
          "size_gb": 900,
          "used_gb": 230,
          "mount_path": "/data",
          "type": "nvme",
          "encrypted": true
        }
      ],
      "snapshots": [
        {
          "id": "snap_123",
          "name": "pre_training_snapshot",
          "size_gb": 45,
          "created_at": "2024-01-15T14:00:00Z"
        }
      ]
    },
    "environment": {
      "variables": {
        "CUDA_VISIBLE_DEVICES": "0,1,2,3",
        "NCCL_SOCKET_IFNAME": "eth0",
        "PYTHONPATH": "/workspace"
      },
      "secrets": ["WANDB_API_KEY", "HUGGINGFACE_TOKEN"],
      "runtime_info": {
        "python_version": "3.9.18",
        "cuda_version": "11.8",
        "driver_version": "520.61.05",
        "docker_version": "24.0.7"
      }
    },
    "auto_terminate": {
      "enabled": true,
      "idle_minutes": 60,
      "max_runtime_hours": 24,
      "cost_limit_usd": 500.0,
      "estimated_termination": "2024-01-16T14:35:00Z",
      "current_idle_minutes": 5
    },
    "uptime_seconds": 4200,
    "created_at": "2024-01-15T14:35:00Z",
    "updated_at": "2024-01-15T15:45:00Z",
    "last_accessed": "2024-01-15T15:42:00Z",
    "tags": {
      "team": "ml-research",
      "project": "llm-training",
      "environment": "production"
    }
  },
  "meta": {
    "request_id": "req_get_456",
    "response_time_ms": 89,
    "cache_hit": false,
    "data_freshness_seconds": 15
  }
}

Response Fields

Status Information

status: Current cluster state (starting, running, stopping, stopped, error)
status_details: Detailed status information and health checks
health_checks: Individual component health status

Configuration Details

configuration: Hardware and software configuration
template_info: Information about the template used
environment: Environment variables and runtime information

Network and Connectivity

network: Complete networking configuration
ssh_connection: SSH access details and status
port_mappings: Exposed ports and their URLs
proxy_url: Main cluster access URL

Performance Metrics

metrics.current: Real-time performance data
metrics.historical: Historical performance trends
metrics.alerts: Active performance alerts

Cost Information

cost.current_hourly_rate: Current billing rate
cost.cost_breakdown: Detailed cost components
cost.cost_projections: Future cost estimates

Use Cases

Cluster Health Monitoring

Monitor cluster health and performance for proactive management.

def check_cluster_health(cluster_id):
    response = requests.get(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={"include_metrics": True}
    )
    
    cluster = response.json()["data"]
    
    health_status = {
        "cluster_id": cluster_id,
        "status": cluster["status"],
        "healthy": True,
        "issues": []
    }
    
    # Check GPU utilization
    gpu_util = cluster["metrics"]["current"]["gpu_utilization"]
    if gpu_util < 10:
        health_status["issues"].append("Low GPU utilization - potential waste")
    elif gpu_util > 95:
        health_status["issues"].append("Very high GPU utilization - potential bottleneck")
    
    # Check temperature
    temp = cluster["metrics"]["current"]["temperature_celsius"]
    if temp > 85:
        health_status["issues"].append(f"High temperature: {temp}°C")
        health_status["healthy"] = False
    
    # Check cost efficiency
    hourly_cost = cluster["cost"]["current_hourly_rate"]
    if hourly_cost > 100:
        health_status["issues"].append(f"High hourly cost: ${hourly_cost}")
    
    return health_status

Connection Information Retrieval

Get connection details for accessing cluster services.

async function getClusterConnectionInfo(clusterId) {
  const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}`, {
    headers: {
      'Authorization': 'Bearer ' + API_KEY,
      'Content-Type': 'application/json'
    }
  });
  
  const cluster = await response.json();
  
  if (!cluster.success) {
    throw new Error(`Failed to get cluster info: ${cluster.error.message}`);
  }
  
  const data = cluster.data;
  
  return {
    cluster_id: clusterId,
    name: data.name,
    status: data.status,
    ssh: {
      host: data.network.ssh_connection.host,
      port: data.network.ssh_connection.port,
      username: data.network.ssh_connection.username,
      command: data.network.ssh_connection.connection_string
    },
    web_services: data.network.port_mappings.map(mapping => ({
      name: mapping.description,
      url: mapping.url,
      port: mapping.external_port,
      status: mapping.status
    })),
    proxy_url: data.network.proxy_url
  };
}

Cost Analysis and Optimization

Analyze cluster costs and identify optimization opportunities.

def analyze_cluster_costs(cluster_id):
    response = requests.get(
        f"https://api.tensorone.ai/v1/clusters/{cluster_id}",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={
            "include_cost_breakdown": True,
            "include_metrics": True,
            "metrics_window": "7d"
        }
    )
    
    cluster = response.json()["data"]
    
    analysis = {
        "cluster_id": cluster_id,
        "current_hourly_cost": cluster["cost"]["current_hourly_rate"],
        "utilization_efficiency": {},
        "cost_optimization_suggestions": []
    }
    
    # Calculate cost efficiency
    gpu_util = cluster["metrics"]["historical"]["gpu_utilization"]["avg"]
    cost_per_gpu_hour = cluster["cost"]["cost_breakdown"]["gpu_cost"]
    
    analysis["utilization_efficiency"] = {
        "gpu_utilization_avg": gpu_util,
        "cost_per_useful_gpu_hour": cost_per_gpu_hour / (gpu_util / 100),
        "efficiency_score": min(gpu_util / 80 * 100, 100)  # 80% is target
    }
    
    # Generate optimization suggestions
    if gpu_util < 30:
        analysis["cost_optimization_suggestions"].append({
            "type": "downgrade_gpu",
            "message": f"Low GPU utilization ({gpu_util}%). Consider smaller GPU type.",
            "potential_savings_percent": 40
        })
    
    if cluster["uptime_seconds"] > 86400 and gpu_util < 20:  # 24 hours
        analysis["cost_optimization_suggestions"].append({
            "type": "auto_terminate",
            "message": "Long runtime with low utilization. Enable auto-termination.",
            "potential_savings_percent": 60
        })
    
    return analysis

Performance Monitoring Dashboard

Create real-time performance monitoring for multiple clusters.

class ClusterMonitor {
  constructor(clusterIds, apiKey) {
    this.clusterIds = clusterIds;
    this.apiKey = apiKey;
    this.metrics = new Map();
  }
  
  async updateMetrics() {
    const promises = this.clusterIds.map(async (clusterId) => {
      try {
        const response = await fetch(`https://api.tensorone.ai/v1/clusters/${clusterId}?include_metrics=true`, {
          headers: {
            'Authorization': 'Bearer ' + this.apiKey,
            'Content-Type': 'application/json'
          }
        });
        
        const data = await response.json();
        
        if (data.success) {
          this.metrics.set(clusterId, {
            name: data.data.name,
            status: data.data.status,
            gpu_utilization: data.data.metrics.current.gpu_utilization,
            memory_utilization: data.data.metrics.current.memory_utilization,
            temperature: data.data.metrics.current.temperature_celsius,
            cost_per_hour: data.data.cost.current_hourly_rate,
            last_updated: new Date()
          });
        }
      } catch (error) {
        console.error(`Failed to update metrics for ${clusterId}:`, error);
      }
    });
    
    await Promise.all(promises);
    return this.metrics;
  }
  
  getAggregatedMetrics() {
    const clusters = Array.from(this.metrics.values());
    
    return {
      total_clusters: clusters.length,
      running_clusters: clusters.filter(c => c.status === 'running').length,
      average_gpu_utilization: clusters.reduce((sum, c) => sum + c.gpu_utilization, 0) / clusters.length,
      total_hourly_cost: clusters.reduce((sum, c) => sum + c.cost_per_hour, 0),
      high_temperature_count: clusters.filter(c => c.temperature > 80).length
    };
  }
}

Error Handling

{
  "success": false,
  "error": {
    "code": "CLUSTER_NOT_FOUND",
    "message": "Cluster with ID 'cluster_invalid' not found",
    "details": {
      "cluster_id": "cluster_invalid",
      "suggestion": "Verify the cluster ID and ensure you have access to this cluster"
    }
  }
}

Security Considerations

Access Control: Ensure proper permissions for cluster access
Sensitive Data: Cluster details may contain sensitive configuration information
API Key Security: Use secure storage for API keys with appropriate scopes
Network Security: Monitor exposed ports and access patterns

Best Practices

Regular Monitoring: Check cluster health and metrics regularly
Cost Awareness: Monitor costs and set up alerts for unexpected charges
Performance Optimization: Use metrics to optimize cluster configurations
Security Compliance: Regularly review access logs and security settings
Resource Planning: Use historical data for future capacity planning

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Overview

Endpoint

Path Parameters

Query Parameters

Request Examples

Response Schema

Response Fields

Status Information

Configuration Details

Network and Connectivity

Performance Metrics

Cost Information

Use Cases

Cluster Health Monitoring

Connection Information Retrieval

Cost Analysis and Optimization

Performance Monitoring Dashboard

Error Handling

Security Considerations

Best Practices

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Overview

​Endpoint

​Path Parameters

​Query Parameters

​Request Examples

​Response Schema

​Response Fields

​Status Information

​Configuration Details

​Network and Connectivity

​Performance Metrics

​Cost Information

​Use Cases

​Cluster Health Monitoring

​Connection Information Retrieval

​Cost Analysis and Optimization

​Performance Monitoring Dashboard

​Error Handling

​Security Considerations

​Best Practices

Overview

Endpoint

Path Parameters

Query Parameters

Request Examples

Response Schema

Response Fields

Status Information

Configuration Details

Network and Connectivity

Performance Metrics

Cost Information

Use Cases

Cluster Health Monitoring

Connection Information Retrieval

Cost Analysis and Optimization

Performance Monitoring Dashboard

Error Handling

Security Considerations

Best Practices