Training Jobs

Training jobs are the core of TensorOne’s managed training platform. They handle the complete lifecycle of model training, from resource allocation to checkpoint management.

Create Training Job

Create a new training job with specified model architecture, dataset, and training configuration.

Required Parameters

name: Human-readable name for the training job (1-100 characters)
modelType: Type of model to train (llm, vision, multimodal, custom)
datasetId: ID of the dataset to use for training
config: Training configuration object

Optional Parameters

description: Description of the training job
tags: Array of tags for organization
gpuType: Preferred GPU type (a100, h100, v100, rtx4090)
maxWorkers: Maximum number of workers (1-32, default: 1)
priority: Job priority (high, normal, low, default: normal)

Example Usage

Fine-tune a Language Model

curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-7b-finetune",
    "modelType": "llm",
    "datasetId": "ds_1234567890abcdef",
    "config": {
      "baseModel": "meta-llama/Llama-2-7b-hf",
      "strategy": "lora",
      "parameters": {
        "rank": 16,
        "alpha": 32,
        "dropout": 0.1,
        "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
      },
      "training": {
        "epochs": 3,
        "batch_size": 4,
        "learning_rate": 2e-4,
        "weight_decay": 0.01,
        "warmup_steps": 100,
        "gradient_accumulation_steps": 8
      },
      "optimization": {
        "mixed_precision": "fp16",
        "gradient_checkpointing": true,
        "max_grad_norm": 1.0
      }
    },
    "gpuType": "a100",
    "maxWorkers": 2,
    "priority": "high"
  }'

Train a Custom Vision Model

curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "custom-object-detection",
    "modelType": "vision",
    "datasetId": "ds_vision_objects_001",
    "config": {
      "architecture": "yolov8",
      "input_size": [640, 640],
      "num_classes": 80,
      "training": {
        "epochs": 100,
        "batch_size": 16,
        "learning_rate": 0.01,
        "momentum": 0.937,
        "weight_decay": 0.0005
      },
      "augmentation": {
        "mixup": 0.1,
        "copy_paste": 0.5,
        "hsv_h": 0.015,
        "hsv_s": 0.7,
        "hsv_v": 0.4
      }
    },
    "gpuType": "v100",
    "maxWorkers": 4
  }'

Response

Returns the created training job object:

{
  "id": "job_1234567890abcdef",
  "name": "llama-7b-finetune",
  "status": "pending",
  "modelType": "llm",
  "datasetId": "ds_1234567890abcdef",
  "config": {
    "baseModel": "meta-llama/Llama-2-7b-hf",
    "strategy": "lora",
    "parameters": {
      "rank": 16,
      "alpha": 32,
      "dropout": 0.1,
      "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
    }
  },
  "resourceAllocation": {
    "gpuType": "NVIDIA A100",
    "workers": 2,
    "memoryPerWorker": "40GB"
  },
  "estimatedCost": {
    "hourly": 8.50,
    "total": 25.50
  },
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:30:00Z"
}

List Training Jobs

Retrieve a list of training jobs for your account.

curl -X GET "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query Parameters

status: Filter by job status (pending, running, completed, failed, cancelled)
modelType: Filter by model type (llm, vision, multimodal, custom)
limit: Number of jobs to return (1-100, default: 50)
offset: Number of jobs to skip for pagination
sort: Sort order (created_at, updated_at, name)
order: Sort direction (asc, desc, default: desc)

Response

{
  "jobs": [
    {
      "id": "job_1234567890abcdef",
      "name": "llama-7b-finetune",
      "status": "running",
      "modelType": "llm",
      "progress": {
        "currentEpoch": 2,
        "totalEpochs": 3,
        "currentStep": 1247,
        "totalSteps": 1875,
        "percentage": 66.5
      },
      "metrics": {
        "loss": 0.342,
        "learningRate": 1.8e-4,
        "throughput": "1250 tokens/sec"
      },
      "createdAt": "2024-01-15T10:30:00Z",
      "startedAt": "2024-01-15T10:35:00Z",
      "estimatedCompletion": "2024-01-15T12:45:00Z"
    }
  ],
  "pagination": {
    "total": 25,
    "limit": 50,
    "offset": 0,
    "hasMore": false
  }
}

Get Training Job Details

Retrieve detailed information about a specific training job.

curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

{
  "id": "job_1234567890abcdef",
  "name": "llama-7b-finetune",
  "status": "running",
  "modelType": "llm",
  "datasetId": "ds_1234567890abcdef",
  "config": {
    "baseModel": "meta-llama/Llama-2-7b-hf",
    "strategy": "lora",
    "parameters": {
      "rank": 16,
      "alpha": 32,
      "dropout": 0.1
    },
    "training": {
      "epochs": 3,
      "batch_size": 4,
      "learning_rate": 2e-4
    }
  },
  "progress": {
    "currentEpoch": 2,
    "totalEpochs": 3,
    "currentStep": 1247,
    "totalSteps": 1875,
    "percentage": 66.5,
    "elapsedTime": 7200,
    "remainingTime": 3600
  },
  "metrics": {
    "currentLoss": 0.342,
    "bestLoss": 0.298,
    "learningRate": 1.8e-4,
    "throughput": "1250 tokens/sec",
    "memoryUsage": "38.2GB"
  },
  "resourceUsage": {
    "gpuHours": 4.2,
    "cost": 35.70,
    "efficiency": 0.94
  },
  "checkpoints": [
    {
      "id": "ckpt_epoch_1",
      "epoch": 1,
      "loss": 0.456,
      "createdAt": "2024-01-15T11:15:00Z",
      "size": "2.3GB"
    }
  ],
  "createdAt": "2024-01-15T10:30:00Z",
  "startedAt": "2024-01-15T10:35:00Z",
  "estimatedCompletion": "2024-01-15T12:45:00Z"
}

Stop Training Job

Stop a running training job and save the current checkpoint.

curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "saveCheckpoint": true,
    "reason": "User requested stop"
  }'

Resume Training Job

Resume a stopped training job from the latest checkpoint.

curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/resume" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "checkpointId": "ckpt_epoch_1"
  }'

SDK Examples

Python SDK

from tensorone import TensorOneClient

client = TensorOneClient(api_key="YOUR_API_KEY")

# Create a fine-tuning job
job = client.training.jobs.create(
    name="llama-7b-finetune",
    model_type="llm",
    dataset_id="ds_1234567890abcdef",
    config={
        "base_model": "meta-llama/Llama-2-7b-hf",
        "strategy": "lora",
        "parameters": {
            "rank": 16,
            "alpha": 32,
            "dropout": 0.1,
            "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
        },
        "training": {
            "epochs": 3,
            "batch_size": 4,
            "learning_rate": 2e-4,
            "weight_decay": 0.01
        }
    },
    gpu_type="a100",
    max_workers=2
)

print(f"Created job: {job.id}")

# Monitor training progress
while job.status in ["pending", "running"]:
    job = client.training.jobs.get(job.id)
    print(f"Progress: {job.progress.percentage}% - Loss: {job.metrics.current_loss}")
    time.sleep(30)

print(f"Training completed with status: {job.status}")

JavaScript SDK

import { TensorOneClient } from '@tensorone/sdk';

const client = new TensorOneClient({ apiKey: 'YOUR_API_KEY' });

// Create a training job
const job = await client.training.jobs.create({
  name: 'llama-7b-finetune',
  modelType: 'llm',
  datasetId: 'ds_1234567890abcdef',
  config: {
    baseModel: 'meta-llama/Llama-2-7b-hf',
    strategy: 'lora',
    parameters: {
      rank: 16,
      alpha: 32,
      dropout: 0.1,
      targetModules: ['q_proj', 'v_proj', 'k_proj', 'o_proj']
    },
    training: {
      epochs: 3,
      batchSize: 4,
      learningRate: 2e-4,
      weightDecay: 0.01
    }
  },
  gpuType: 'a100',
  maxWorkers: 2
});

console.log(`Created job: ${job.id}`);

// Monitor progress
const monitorJob = async (jobId) => {
  const job = await client.training.jobs.get(jobId);
  console.log(`Progress: ${job.progress.percentage}% - Loss: ${job.metrics.currentLoss}`);
  
  if (job.status === 'running' || job.status === 'pending') {
    setTimeout(() => monitorJob(jobId), 30000);
  } else {
    console.log(`Training completed with status: ${job.status}`);
  }
};

monitorJob(job.id);

Error Handling

Common Errors

{
  "error": "INSUFFICIENT_RESOURCES",
  "message": "Requested GPU type not available",
  "details": {
    "requestedGpuType": "h100",
    "availableGpuTypes": ["a100", "v100", "rtx4090"]
  }
}

{
  "error": "DATASET_NOT_FOUND",
  "message": "Dataset with specified ID does not exist",
  "details": {
    "datasetId": "ds_invalid_id"
  }
}

{
  "error": "CONFIGURATION_ERROR",
  "message": "Invalid training configuration",
  "details": {
    "field": "config.training.learning_rate",
    "reason": "Learning rate must be between 1e-6 and 1e-1"
  }
}

Best Practices

Resource Optimization

Choose appropriate GPU types based on model size and memory requirements
Use gradient accumulation to effectively increase batch size on limited memory
Enable mixed precision training for faster convergence and memory efficiency
Monitor resource utilization and adjust worker count accordingly

Cost Management

Use spot instances for non-critical training jobs to reduce costs
Implement early stopping to prevent unnecessary training iterations
Set cost alerts to monitor training expenses
Consider using smaller models or LoRA fine-tuning for cost efficiency

Training Stability

Start with conservative learning rates and gradually increase
Use gradient clipping to prevent exploding gradients
Implement proper validation splits to monitor overfitting
Save checkpoints frequently to recover from interruptions

Training jobs are billed per second of GPU usage. Jobs in pending status are not billed until they transition to running.

Large training jobs may take several minutes to provision resources. Monitor the job status and be patient during the initial setup phase.

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Create Training Job

Required Parameters

Optional Parameters

Example Usage

Fine-tune a Language Model

Train a Custom Vision Model

Response

List Training Jobs

Query Parameters

Response

Get Training Job Details

Response

Stop Training Job

Resume Training Job

SDK Examples

Python SDK

JavaScript SDK

Error Handling

Common Errors

Best Practices

Resource Optimization

Cost Management

Training Stability

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Create Training Job

​Required Parameters

​Optional Parameters

​Example Usage

​Fine-tune a Language Model

​Train a Custom Vision Model

​Response

​List Training Jobs

​Query Parameters

​Response

​Get Training Job Details

​Response

​Stop Training Job

​Resume Training Job

​SDK Examples

​Python SDK

​JavaScript SDK

​Error Handling

​Common Errors

​Best Practices

​Resource Optimization

​Cost Management

​Training Stability

Create Training Job

Required Parameters

Optional Parameters

Example Usage

Fine-tune a Language Model

Train a Custom Vision Model

Response

List Training Jobs

Query Parameters

Response

Get Training Job Details

Response

Stop Training Job

Resume Training Job

SDK Examples

Python SDK

JavaScript SDK

Error Handling

Common Errors

Best Practices

Resource Optimization

Cost Management

Training Stability