curl -X GET "https://api.tensorone.ai/v2/monitoring/alerts" \
-H "Authorization: Bearer YOUR_API_KEY" \
-G \
-d "status=active" \
-d "severity[]=critical&severity[]=high" \
-d "timeRange=24h" \
-d "limit=20"
{
"alerts": [
{
"alertId": "alert-critical-001",
"title": "GPU Cluster High Memory Utilization",
"description": "Memory utilization has exceeded 95% for more than 10 minutes on GPU cluster",
"severity": "critical",
"category": "performance",
"status": "active",
"source": {
"resourceType": "cluster",
"resourceId": "cluster-gpu-a100-001",
"resourceName": "GPU Cluster A100-001",
"region": "us-east-1"
},
"trigger": {
"metric": "memory_utilization",
"condition": "greater_than",
"threshold": 90,
"currentValue": 97.3,
"duration": "12m 34s"
},
"timestamps": {
"triggered": "2024-01-16T17:45:00Z",
"lastUpdated": "2024-01-16T17:57:34Z",
"acknowledged": null,
"resolved": null
},
"impact": {
"affectedUsers": 42,
"serviceImpact": "significant",
"estimatedCost": 125.50,
"slaImpact": true
},
"recommendations": [
"Scale up cluster to add more memory capacity",
"Identify and terminate memory-intensive processes",
"Enable automatic scaling if not already configured"
],
"tags": ["production", "gpu", "memory"],
"assignee": "ops-team"
},
{
"alertId": "alert-high-002",
"title": "API Response Time Degradation",
"description": "Average API response time has increased by 150% over the last 30 minutes",
"severity": "high",
"category": "performance",
"status": "active",
"source": {
"resourceType": "api",
"resourceId": "api-gateway-main",
"resourceName": "Main API Gateway",
"region": "global"
},
"trigger": {
"metric": "average_response_time",
"condition": "greater_than",
"threshold": 500,
"currentValue": 847,
"duration": "32m 18s"
},
"timestamps": {
"triggered": "2024-01-16T17:30:00Z",
"lastUpdated": "2024-01-16T18:02:18Z",
"acknowledged": "2024-01-16T17:35:00Z",
"resolved": null
},
"impact": {
"affectedUsers": 156,
"serviceImpact": "moderate",
"estimatedCost": 75.25,
"slaImpact": false
},
"recommendations": [
"Check for database connection issues",
"Review recent deployments for performance regressions",
"Consider enabling API caching for frequent requests"
],
"tags": ["api", "performance", "response-time"],
"assignee": "backend-team"
}
],
"summary": {
"total": 8,
"bySeverity": {
"critical": 1,
"high": 2,
"medium": 3,
"low": 2,
"info": 0
},
"byCategory": {
"performance": 5,
"availability": 1,
"capacity": 1,
"security": 1
},
"byStatus": {
"active": 6,
"acknowledged": 2,
"resolved": 0
},
"trends": {
"last24h": 8,
"previousDay": 12,
"weeklyAverage": 15.3
}
},
"pagination": {
"limit": 20,
"offset": 0,
"total": 8,
"hasMore": false
}
}
Manage and configure intelligent alerts for proactive monitoring of TensorOne infrastructure and services
curl -X GET "https://api.tensorone.ai/v2/monitoring/alerts" \
-H "Authorization: Bearer YOUR_API_KEY" \
-G \
-d "status=active" \
-d "severity[]=critical&severity[]=high" \
-d "timeRange=24h" \
-d "limit=20"
{
"alerts": [
{
"alertId": "alert-critical-001",
"title": "GPU Cluster High Memory Utilization",
"description": "Memory utilization has exceeded 95% for more than 10 minutes on GPU cluster",
"severity": "critical",
"category": "performance",
"status": "active",
"source": {
"resourceType": "cluster",
"resourceId": "cluster-gpu-a100-001",
"resourceName": "GPU Cluster A100-001",
"region": "us-east-1"
},
"trigger": {
"metric": "memory_utilization",
"condition": "greater_than",
"threshold": 90,
"currentValue": 97.3,
"duration": "12m 34s"
},
"timestamps": {
"triggered": "2024-01-16T17:45:00Z",
"lastUpdated": "2024-01-16T17:57:34Z",
"acknowledged": null,
"resolved": null
},
"impact": {
"affectedUsers": 42,
"serviceImpact": "significant",
"estimatedCost": 125.50,
"slaImpact": true
},
"recommendations": [
"Scale up cluster to add more memory capacity",
"Identify and terminate memory-intensive processes",
"Enable automatic scaling if not already configured"
],
"tags": ["production", "gpu", "memory"],
"assignee": "ops-team"
},
{
"alertId": "alert-high-002",
"title": "API Response Time Degradation",
"description": "Average API response time has increased by 150% over the last 30 minutes",
"severity": "high",
"category": "performance",
"status": "active",
"source": {
"resourceType": "api",
"resourceId": "api-gateway-main",
"resourceName": "Main API Gateway",
"region": "global"
},
"trigger": {
"metric": "average_response_time",
"condition": "greater_than",
"threshold": 500,
"currentValue": 847,
"duration": "32m 18s"
},
"timestamps": {
"triggered": "2024-01-16T17:30:00Z",
"lastUpdated": "2024-01-16T18:02:18Z",
"acknowledged": "2024-01-16T17:35:00Z",
"resolved": null
},
"impact": {
"affectedUsers": 156,
"serviceImpact": "moderate",
"estimatedCost": 75.25,
"slaImpact": false
},
"recommendations": [
"Check for database connection issues",
"Review recent deployments for performance regressions",
"Consider enabling API caching for frequent requests"
],
"tags": ["api", "performance", "response-time"],
"assignee": "backend-team"
}
],
"summary": {
"total": 8,
"bySeverity": {
"critical": 1,
"high": 2,
"medium": 3,
"low": 2,
"info": 0
},
"byCategory": {
"performance": 5,
"availability": 1,
"capacity": 1,
"security": 1
},
"byStatus": {
"active": 6,
"acknowledged": 2,
"resolved": 0
},
"trends": {
"last24h": 8,
"previousDay": 12,
"weeklyAverage": 15.3
}
},
"pagination": {
"limit": 20,
"offset": 0,
"total": 8,
"hasMore": false
}
}
active - Currently active alertsresolved - Recently resolved alertsall - All alerts regardless of statusacknowledged - Alerts that have been acknowledgedsuppressed - Temporarily suppressed alertscritical - System down or severe impacthigh - High impact on performance or availabilitymedium - Noticeable impact, requires attentionlow - Minor issues or early warningsinfo - Informational alertsperformance - Performance degradation alertsavailability - Service availability issuescapacity - Resource capacity warningssecurity - Security-related alertscost - Cost threshold alertsmaintenance - Maintenance and update alertsclusters - GPU cluster alertsendpoints - Serverless endpoint alertstraining - Training job alertsai-services - AI service alertsinfrastructure - Platform infrastructure alerts1h - Last hour6h - Last 6 hours24h - Last 24 hours7d - Last 7 days30d - Last 30 daysShow Alert Object
active, resolved, acknowledged, suppressedcurl -X GET "https://api.tensorone.ai/v2/monitoring/alerts" \
-H "Authorization: Bearer YOUR_API_KEY" \
-G \
-d "status=active" \
-d "severity[]=critical&severity[]=high" \
-d "timeRange=24h" \
-d "limit=20"
{
"alerts": [
{
"alertId": "alert-critical-001",
"title": "GPU Cluster High Memory Utilization",
"description": "Memory utilization has exceeded 95% for more than 10 minutes on GPU cluster",
"severity": "critical",
"category": "performance",
"status": "active",
"source": {
"resourceType": "cluster",
"resourceId": "cluster-gpu-a100-001",
"resourceName": "GPU Cluster A100-001",
"region": "us-east-1"
},
"trigger": {
"metric": "memory_utilization",
"condition": "greater_than",
"threshold": 90,
"currentValue": 97.3,
"duration": "12m 34s"
},
"timestamps": {
"triggered": "2024-01-16T17:45:00Z",
"lastUpdated": "2024-01-16T17:57:34Z",
"acknowledged": null,
"resolved": null
},
"impact": {
"affectedUsers": 42,
"serviceImpact": "significant",
"estimatedCost": 125.50,
"slaImpact": true
},
"recommendations": [
"Scale up cluster to add more memory capacity",
"Identify and terminate memory-intensive processes",
"Enable automatic scaling if not already configured"
],
"tags": ["production", "gpu", "memory"],
"assignee": "ops-team"
},
{
"alertId": "alert-high-002",
"title": "API Response Time Degradation",
"description": "Average API response time has increased by 150% over the last 30 minutes",
"severity": "high",
"category": "performance",
"status": "active",
"source": {
"resourceType": "api",
"resourceId": "api-gateway-main",
"resourceName": "Main API Gateway",
"region": "global"
},
"trigger": {
"metric": "average_response_time",
"condition": "greater_than",
"threshold": 500,
"currentValue": 847,
"duration": "32m 18s"
},
"timestamps": {
"triggered": "2024-01-16T17:30:00Z",
"lastUpdated": "2024-01-16T18:02:18Z",
"acknowledged": "2024-01-16T17:35:00Z",
"resolved": null
},
"impact": {
"affectedUsers": 156,
"serviceImpact": "moderate",
"estimatedCost": 75.25,
"slaImpact": false
},
"recommendations": [
"Check for database connection issues",
"Review recent deployments for performance regressions",
"Consider enabling API caching for frequent requests"
],
"tags": ["api", "performance", "response-time"],
"assignee": "backend-team"
}
],
"summary": {
"total": 8,
"bySeverity": {
"critical": 1,
"high": 2,
"medium": 3,
"low": 2,
"info": 0
},
"byCategory": {
"performance": 5,
"availability": 1,
"capacity": 1,
"security": 1
},
"byStatus": {
"active": 6,
"acknowledged": 2,
"resolved": 0
},
"trends": {
"last24h": 8,
"previousDay": 12,
"weeklyAverage": 15.3
}
},
"pagination": {
"limit": 20,
"offset": 0,
"total": 8,
"hasMore": false
}
}
def create_alert_rule(name, description, conditions, actions):
"""Create a custom alert rule"""
rule_data = {
"name": name,
"description": description,
"enabled": True,
"conditions": conditions,
"actions": actions,
"severity": "medium",
"category": "custom"
}
response = requests.post(
"https://api.tensorone.ai/v2/monitoring/alerts/rules",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json=rule_data
)
return response.json()
# Create GPU utilization alert
gpu_alert_rule = create_alert_rule(
name="High GPU Utilization",
description="Alert when GPU utilization exceeds 95% for more than 5 minutes",
conditions=[
{
"metric": "gpu_utilization",
"condition": "greater_than",
"threshold": 95,
"duration": "5m",
"resourceType": "cluster"
}
],
actions=[
{
"type": "email",
"recipients": ["ops@company.com"],
"template": "gpu_high_utilization"
},
{
"type": "webhook",
"url": "https://hooks.slack.com/your/webhook/url",
"payload": {"channel": "#alerts"}
}
]
)
print(f"Created alert rule: {gpu_alert_rule['ruleId']}")
def acknowledge_alert(alert_id, assignee=None, notes=None):
"""Acknowledge an alert"""
data = {
"action": "acknowledge",
"assignee": assignee,
"notes": notes,
"timestamp": datetime.utcnow().isoformat() + "Z"
}
response = requests.post(
f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/action",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json=data
)
return response.json()
def resolve_alert(alert_id, resolution_notes, resolution_time=None):
"""Resolve an alert"""
data = {
"action": "resolve",
"resolutionNotes": resolution_notes,
"resolutionTime": resolution_time or datetime.utcnow().isoformat() + "Z"
}
response = requests.post(
f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/action",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json=data
)
return response.json()
# Example usage
acknowledge_result = acknowledge_alert(
"alert-critical-001",
assignee="john.doe@company.com",
notes="Investigating memory usage patterns. Scaling up cluster resources."
)
resolve_result = resolve_alert(
"alert-critical-001",
resolution_notes="Added additional memory to cluster. Utilization now at 78%."
)
print(f"Alert acknowledged: {acknowledge_result['success']}")
print(f"Alert resolved: {resolve_result['success']}")
def bulk_alert_operations(alert_ids, action, **kwargs):
"""Perform bulk operations on multiple alerts"""
data = {
"alertIds": alert_ids,
"action": action,
**kwargs
}
response = requests.post(
"https://api.tensorone.ai/v2/monitoring/alerts/bulk",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json=data
)
return response.json()
# Bulk acknowledge alerts
alert_ids = ["alert-001", "alert-002", "alert-003"]
bulk_result = bulk_alert_operations(
alert_ids,
action="acknowledge",
assignee="incident-team@company.com",
notes="Bulk acknowledged for incident response"
)
print(f"Bulk operation results:")
for result in bulk_result['results']:
print(f" {result['alertId']}: {result['status']}")
def get_grouped_alerts(grouping_criteria="resource"):
"""Get alerts grouped by specified criteria"""
response = requests.get(
"https://api.tensorone.ai/v2/monitoring/alerts/grouped",
headers={"Authorization": "Bearer YOUR_API_KEY"},
params={
"groupBy": grouping_criteria,
"status": "active",
"timeRange": "24h"
}
)
return response.json()
# Get alerts grouped by resource
grouped_alerts = get_grouped_alerts("resource")
print("📊 Grouped Alert Summary:")
print("=" * 30)
for group in grouped_alerts['groups']:
alert_count = len(group['alerts'])
severity_counts = {}
for alert in group['alerts']:
severity = alert['severity']
severity_counts[severity] = severity_counts.get(severity, 0) + 1
print(f"\n🔧 {group['groupKey']} ({alert_count} alerts)")
for severity, count in severity_counts.items():
severity_icon = {
'critical': '🔴',
'high': '🟠',
'medium': '🟡',
'low': '🟢'
}.get(severity, '⚪')
print(f" {severity_icon} {severity}: {count}")
# Show most critical alert in group
critical_alerts = [a for a in group['alerts'] if a['severity'] == 'critical']
if critical_alerts:
alert = critical_alerts[0]
print(f" 📍 Most critical: {alert['title']}")
def get_alert_correlations(alert_id):
"""Get correlated alerts and potential root causes"""
response = requests.get(
f"https://api.tensorone.ai/v2/monitoring/alerts/{alert_id}/correlations",
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
return response.json()
def analyze_root_cause(alert_id):
"""Analyze potential root causes for an alert"""
correlations = get_alert_correlations(alert_id)
print(f"🔍 Root Cause Analysis for Alert: {alert_id}")
print("=" * 50)
if 'rootCauses' in correlations:
print("\n🎯 Potential Root Causes:")
for cause in correlations['rootCauses'][:3]: # Top 3
confidence_bar = "█" * int(cause['confidence'] * 10)
print(f" {confidence_bar} {cause['confidence']:.0%} - {cause['description']}")
if cause.get('evidence'):
print(f" Evidence: {cause['evidence']}")
if 'correlatedAlerts' in correlations:
print(f"\n🔗 Correlated Alerts ({len(correlations['correlatedAlerts'])}):")
for corr_alert in correlations['correlatedAlerts'][:5]:
print(f" • {corr_alert['title']} (correlation: {corr_alert['correlationScore']:.2f})")
if 'timeline' in correlations:
print("\n⏰ Event Timeline:")
for event in correlations['timeline']:
print(f" {event['timestamp']} - {event['description']}")
# Example usage
analyze_root_cause("alert-critical-001")
def get_predictive_alerts():
"""Get predictive alerts based on trends and anomalies"""
response = requests.get(
"https://api.tensorone.ai/v2/monitoring/alerts/predictive",
headers={"Authorization": "Bearer YOUR_API_KEY"},
params={
"predictionWindow": "2h",
"confidenceThreshold": 0.7
}
)
return response.json()
def display_predictive_alerts():
"""Display predictive alerts dashboard"""
predictions = get_predictive_alerts()
print("🔮 Predictive Alert Dashboard")
print("=" * 35)
if predictions['predictions']:
print(f"\n⚠️ {len(predictions['predictions'])} Potential Issues Detected:")
for prediction in predictions['predictions']:
confidence_str = f"{prediction['confidence']:.0%}"
eta = prediction['estimatedTimeToIssue']
risk_icon = "🔴" if prediction['confidence'] > 0.9 else "🟡"
print(f"\n{risk_icon} {prediction['title']} ({confidence_str} confidence)")
print(f" Resource: {prediction['resource']['name']}")
print(f" Estimated time to issue: {eta}")
print(f" Predicted impact: {prediction['predictedSeverity']}")
if prediction['recommendations']:
print(f" 💡 Preventive action: {prediction['recommendations'][0]}")
else:
print("\n✅ No potential issues detected in the next 2 hours")
# Trend analysis
if 'trends' in predictions:
print(f"\n📈 Trend Analysis:")
for trend in predictions['trends']:
direction = "📈" if trend['direction'] == 'increasing' else "📉"
print(f" {direction} {trend['metric']}: {trend['description']}")
display_predictive_alerts()
def setup_webhook_integration(webhook_url, events=None):
"""Set up webhook integration for alerts"""
integration_data = {
"type": "webhook",
"name": "Alert Webhook Integration",
"config": {
"url": webhook_url,
"method": "POST",
"headers": {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_WEBHOOK_TOKEN"
}
},
"events": events or [
"alert.triggered",
"alert.resolved",
"alert.acknowledged"
],
"filters": {
"severity": ["critical", "high"],
"category": ["performance", "availability"]
}
}
response = requests.post(
"https://api.tensorone.ai/v2/monitoring/integrations",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json=integration_data
)
return response.json()
# Set up Slack webhook integration
slack_integration = setup_webhook_integration(
"https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
events=["alert.triggered", "alert.resolved"]
)
print(f"Webhook integration created: {slack_integration['integrationId']}")
def create_automated_response(trigger_conditions, actions):
"""Create automated response for specific alert conditions"""
automation_data = {
"name": "Auto Scale on High CPU",
"description": "Automatically scale clusters when CPU utilization is high",
"enabled": True,
"triggerConditions": trigger_conditions,
"actions": actions,
"cooldownPeriod": "10m" # Prevent rapid successive triggers
}
response = requests.post(
"https://api.tensorone.ai/v2/monitoring/automations",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json=automation_data
)
return response.json()
# Create auto-scaling automation
auto_scale_response = create_automated_response(
trigger_conditions=[
{
"alertCategory": "performance",
"metric": "cpu_utilization",
"threshold": 85,
"duration": "5m",
"resourceType": "cluster"
}
],
actions=[
{
"type": "scale_cluster",
"parameters": {
"scaleDirection": "up",
"scaleAmount": 1
}
},
{
"type": "notify",
"parameters": {
"channel": "slack",
"message": "Auto-scaled cluster due to high CPU utilization"
}
}
]
)
print(f"Automation created: {auto_scale_response['automationId']}")