Training jobs are the core of TensorOne’s managed training platform. They handle the complete lifecycle of model training, from resource allocation to checkpoint management.
Create Training Job
Create a new training job with specified model architecture, dataset, and training configuration.
Required Parameters
name: Human-readable name for the training job (1-100 characters)
modelType: Type of model to train (llm, vision, multimodal, custom)
datasetId: ID of the dataset to use for training
config: Training configuration object
Optional Parameters
description: Description of the training job
tags: Array of tags for organization
gpuType: Preferred GPU type (a100, h100, v100, rtx4090)
maxWorkers: Maximum number of workers (1-32, default: 1)
priority: Job priority (high, normal, low, default: normal)
Example Usage
Fine-tune a Language Model
curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "llama-7b-finetune",
"modelType": "llm",
"datasetId": "ds_1234567890abcdef",
"config": {
"baseModel": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
},
"training": {
"epochs": 3,
"batch_size": 4,
"learning_rate": 2e-4,
"weight_decay": 0.01,
"warmup_steps": 100,
"gradient_accumulation_steps": 8
},
"optimization": {
"mixed_precision": "fp16",
"gradient_checkpointing": true,
"max_grad_norm": 1.0
}
},
"gpuType": "a100",
"maxWorkers": 2,
"priority": "high"
}'
Train a Custom Vision Model
curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "custom-object-detection",
"modelType": "vision",
"datasetId": "ds_vision_objects_001",
"config": {
"architecture": "yolov8",
"input_size": [640, 640],
"num_classes": 80,
"training": {
"epochs": 100,
"batch_size": 16,
"learning_rate": 0.01,
"momentum": 0.937,
"weight_decay": 0.0005
},
"augmentation": {
"mixup": 0.1,
"copy_paste": 0.5,
"hsv_h": 0.015,
"hsv_s": 0.7,
"hsv_v": 0.4
}
},
"gpuType": "v100",
"maxWorkers": 4
}'
Response
Returns the created training job object:
{
"id": "job_1234567890abcdef",
"name": "llama-7b-finetune",
"status": "pending",
"modelType": "llm",
"datasetId": "ds_1234567890abcdef",
"config": {
"baseModel": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
}
},
"resourceAllocation": {
"gpuType": "NVIDIA A100",
"workers": 2,
"memoryPerWorker": "40GB"
},
"estimatedCost": {
"hourly": 8.50,
"total": 25.50
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:30:00Z"
}
List Training Jobs
Retrieve a list of training jobs for your account.
curl -X GET "https://api.tensorone.ai/v2/training/jobs" \
-H "Authorization: Bearer YOUR_API_KEY"
Query Parameters
status: Filter by job status (pending, running, completed, failed, cancelled)
modelType: Filter by model type (llm, vision, multimodal, custom)
limit: Number of jobs to return (1-100, default: 50)
offset: Number of jobs to skip for pagination
sort: Sort order (created_at, updated_at, name)
order: Sort direction (asc, desc, default: desc)
Response
{
"jobs": [
{
"id": "job_1234567890abcdef",
"name": "llama-7b-finetune",
"status": "running",
"modelType": "llm",
"progress": {
"currentEpoch": 2,
"totalEpochs": 3,
"currentStep": 1247,
"totalSteps": 1875,
"percentage": 66.5
},
"metrics": {
"loss": 0.342,
"learningRate": 1.8e-4,
"throughput": "1250 tokens/sec"
},
"createdAt": "2024-01-15T10:30:00Z",
"startedAt": "2024-01-15T10:35:00Z",
"estimatedCompletion": "2024-01-15T12:45:00Z"
}
],
"pagination": {
"total": 25,
"limit": 50,
"offset": 0,
"hasMore": false
}
}
Get Training Job Details
Retrieve detailed information about a specific training job.
curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY"
Response
{
"id": "job_1234567890abcdef",
"name": "llama-7b-finetune",
"status": "running",
"modelType": "llm",
"datasetId": "ds_1234567890abcdef",
"config": {
"baseModel": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1
},
"training": {
"epochs": 3,
"batch_size": 4,
"learning_rate": 2e-4
}
},
"progress": {
"currentEpoch": 2,
"totalEpochs": 3,
"currentStep": 1247,
"totalSteps": 1875,
"percentage": 66.5,
"elapsedTime": 7200,
"remainingTime": 3600
},
"metrics": {
"currentLoss": 0.342,
"bestLoss": 0.298,
"learningRate": 1.8e-4,
"throughput": "1250 tokens/sec",
"memoryUsage": "38.2GB"
},
"resourceUsage": {
"gpuHours": 4.2,
"cost": 35.70,
"efficiency": 0.94
},
"checkpoints": [
{
"id": "ckpt_epoch_1",
"epoch": 1,
"loss": 0.456,
"createdAt": "2024-01-15T11:15:00Z",
"size": "2.3GB"
}
],
"createdAt": "2024-01-15T10:30:00Z",
"startedAt": "2024-01-15T10:35:00Z",
"estimatedCompletion": "2024-01-15T12:45:00Z"
}
Stop Training Job
Stop a running training job and save the current checkpoint.
curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/stop" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"saveCheckpoint": true,
"reason": "User requested stop"
}'
Resume Training Job
Resume a stopped training job from the latest checkpoint.
curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/resume" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"checkpointId": "ckpt_epoch_1"
}'
SDK Examples
Python SDK
from tensorone import TensorOneClient
client = TensorOneClient(api_key="YOUR_API_KEY")
# Create a fine-tuning job
job = client.training.jobs.create(
name="llama-7b-finetune",
model_type="llm",
dataset_id="ds_1234567890abcdef",
config={
"base_model": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
},
"training": {
"epochs": 3,
"batch_size": 4,
"learning_rate": 2e-4,
"weight_decay": 0.01
}
},
gpu_type="a100",
max_workers=2
)
print(f"Created job: {job.id}")
# Monitor training progress
while job.status in ["pending", "running"]:
job = client.training.jobs.get(job.id)
print(f"Progress: {job.progress.percentage}% - Loss: {job.metrics.current_loss}")
time.sleep(30)
print(f"Training completed with status: {job.status}")
JavaScript SDK
import { TensorOneClient } from '@tensorone/sdk';
const client = new TensorOneClient({ apiKey: 'YOUR_API_KEY' });
// Create a training job
const job = await client.training.jobs.create({
name: 'llama-7b-finetune',
modelType: 'llm',
datasetId: 'ds_1234567890abcdef',
config: {
baseModel: 'meta-llama/Llama-2-7b-hf',
strategy: 'lora',
parameters: {
rank: 16,
alpha: 32,
dropout: 0.1,
targetModules: ['q_proj', 'v_proj', 'k_proj', 'o_proj']
},
training: {
epochs: 3,
batchSize: 4,
learningRate: 2e-4,
weightDecay: 0.01
}
},
gpuType: 'a100',
maxWorkers: 2
});
console.log(`Created job: ${job.id}`);
// Monitor progress
const monitorJob = async (jobId) => {
const job = await client.training.jobs.get(jobId);
console.log(`Progress: ${job.progress.percentage}% - Loss: ${job.metrics.currentLoss}`);
if (job.status === 'running' || job.status === 'pending') {
setTimeout(() => monitorJob(jobId), 30000);
} else {
console.log(`Training completed with status: ${job.status}`);
}
};
monitorJob(job.id);
Error Handling
Common Errors
{
"error": "INSUFFICIENT_RESOURCES",
"message": "Requested GPU type not available",
"details": {
"requestedGpuType": "h100",
"availableGpuTypes": ["a100", "v100", "rtx4090"]
}
}
{
"error": "DATASET_NOT_FOUND",
"message": "Dataset with specified ID does not exist",
"details": {
"datasetId": "ds_invalid_id"
}
}
{
"error": "CONFIGURATION_ERROR",
"message": "Invalid training configuration",
"details": {
"field": "config.training.learning_rate",
"reason": "Learning rate must be between 1e-6 and 1e-1"
}
}
Best Practices
Resource Optimization
- Choose appropriate GPU types based on model size and memory requirements
- Use gradient accumulation to effectively increase batch size on limited memory
- Enable mixed precision training for faster convergence and memory efficiency
- Monitor resource utilization and adjust worker count accordingly
Cost Management
- Use spot instances for non-critical training jobs to reduce costs
- Implement early stopping to prevent unnecessary training iterations
- Set cost alerts to monitor training expenses
- Consider using smaller models or LoRA fine-tuning for cost efficiency
Training Stability
- Start with conservative learning rates and gradually increase
- Use gradient clipping to prevent exploding gradients
- Implement proper validation splits to monitor overfitting
- Save checkpoints frequently to recover from interruptions
Training jobs are billed per second of GPU usage. Jobs in pending status are not billed until they transition to running.
Large training jobs may take several minutes to provision resources. Monitor the job status and be patient during the initial setup phase.