Skip to main content
Datasets are the foundation of successful AI training. TensorOne’s dataset management system provides secure upload, validation, preprocessing, and versioning capabilities for all types of training data.

Upload Dataset

Upload a new dataset for training. Supports various formats including text, images, audio, and structured data.

Required Parameters

  • name: Human-readable name for the dataset (1-100 characters)
  • type: Dataset type (text, image, audio, video, structured, multimodal)
  • format: Data format (jsonl, csv, parquet, hdf5, zip, tar)

Optional Parameters

  • description: Description of the dataset
  • tags: Array of tags for organization
  • validation: Validation configuration object
  • preprocessing: Preprocessing configuration object
  • metadata: Additional metadata object

Example Usage

Upload Text Dataset for Language Model Training

curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "custom-instruction-dataset",
    "type": "text",
    "format": "jsonl",
    "description": "Custom instruction-response pairs for fine-tuning",
    "validation": {
      "required_fields": ["instruction", "response"],
      "max_sequence_length": 2048,
      "min_examples": 100
    },
    "preprocessing": {
      "tokenizer": "meta-llama/Llama-2-7b-hf",
      "add_special_tokens": true,
      "truncation": true,
      "padding": "max_length"
    },
    "metadata": {
      "source": "customer_conversations",
      "language": "en",
      "domain": "customer_support"
    }
  }'

Upload Image Dataset for Computer Vision

curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "product-classification-images",
    "type": "image",
    "format": "zip",
    "description": "Product images with category labels",
    "validation": {
      "supported_formats": ["jpg", "png", "webp"],
      "min_resolution": [224, 224],
      "max_file_size": "10MB",
      "min_examples_per_class": 50
    },
    "preprocessing": {
      "resize": [512, 512],
      "normalize": true,
      "augmentation": {
        "horizontal_flip": 0.5,
        "rotation": 15,
        "brightness": 0.2,
        "contrast": 0.2
      }
    },
    "metadata": {
      "num_classes": 25,
      "image_source": "product_catalog",
      "annotation_format": "directory_structure"
    }
  }'

Upload Multimodal Dataset

curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "vision-language-pairs",
    "type": "multimodal",
    "format": "jsonl",
    "description": "Image-caption pairs for vision-language model training",
    "validation": {
      "required_fields": ["image_path", "caption"],
      "image_formats": ["jpg", "png"],
      "min_caption_length": 5,
      "max_caption_length": 200
    },
    "preprocessing": {
      "vision": {
        "resize": [336, 336],
        "normalize": true,
        "center_crop": true
      },
      "text": {
        "tokenizer": "openai/clip-vit-base-patch32",
        "max_length": 77,
        "truncation": true
      }
    }
  }'

Response

Returns the created dataset object:
{
  "id": "ds_1234567890abcdef",
  "name": "custom-instruction-dataset",
  "type": "text",
  "format": "jsonl",
  "status": "uploading",
  "uploadUrl": "https://upload.tensorone.ai/datasets/ds_1234567890abcdef",
  "uploadToken": "tok_upload_1234567890abcdef",
  "validation": {
    "required_fields": ["instruction", "response"],
    "max_sequence_length": 2048,
    "min_examples": 100
  },
  "size": {
    "bytes": 0,
    "examples": 0
  },
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:30:00Z"
}

Upload Data to Dataset

After creating a dataset, upload your data files using the provided upload URL and token.

Upload via Direct HTTP

curl -X PUT "https://upload.tensorone.ai/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer tok_upload_1234567890abcdef" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @training_data.jsonl

Upload Large Files (Multipart)

For files larger than 100MB, use multipart upload:
# Initiate multipart upload
curl -X POST "https://upload.tensorone.ai/datasets/ds_1234567890abcdef/multipart" \
  -H "Authorization: Bearer tok_upload_1234567890abcdef" \
  -H "Content-Type: application/json" \
  -d '{
    "filename": "large_dataset.zip",
    "size": 2147483648
  }'

Get Dataset Details

Retrieve detailed information about a specific dataset.
curl -X GET "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

{
  "id": "ds_1234567890abcdef",
  "name": "custom-instruction-dataset",
  "type": "text",
  "format": "jsonl",
  "status": "ready",
  "description": "Custom instruction-response pairs for fine-tuning",
  "size": {
    "bytes": 52428800,
    "examples": 10000,
    "compressed": 15728640
  },
  "schema": {
    "fields": [
      {"name": "instruction", "type": "string", "nullable": false},
      {"name": "response", "type": "string", "nullable": false},
      {"name": "context", "type": "string", "nullable": true}
    ]
  },
  "statistics": {
    "avg_instruction_length": 45.2,
    "avg_response_length": 128.7,
    "unique_instructions": 9847,
    "language_distribution": {
      "en": 0.92,
      "es": 0.05,
      "fr": 0.03
    }
  },
  "validation": {
    "status": "passed",
    "checks": [
      {"name": "required_fields", "status": "passed"},
      {"name": "sequence_length", "status": "passed"},
      {"name": "minimum_examples", "status": "passed"}
    ],
    "warnings": [
      "2 examples exceed recommended response length"
    ]
  },
  "preprocessing": {
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "total_tokens": 1567890,
    "vocab_size": 32000
  },
  "versions": [
    {
      "id": "v1",
      "createdAt": "2024-01-15T10:30:00Z",
      "size": 52428800,
      "checksum": "sha256:a1b2c3d4e5f6..."
    }
  ],
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T11:45:00Z"
}

List Datasets

Retrieve a list of datasets for your account.
curl -X GET "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query Parameters

  • type: Filter by dataset type (text, image, audio, video, structured, multimodal)
  • status: Filter by status (uploading, processing, ready, error)
  • limit: Number of datasets to return (1-100, default: 50)
  • offset: Number of datasets to skip for pagination
  • sort: Sort order (created_at, updated_at, name, size)
  • order: Sort direction (asc, desc, default: desc)

Response

{
  "datasets": [
    {
      "id": "ds_1234567890abcdef",
      "name": "custom-instruction-dataset",
      "type": "text",
      "status": "ready",
      "size": {
        "bytes": 52428800,
        "examples": 10000
      },
      "createdAt": "2024-01-15T10:30:00Z",
      "updatedAt": "2024-01-15T11:45:00Z"
    }
  ],
  "pagination": {
    "total": 15,
    "limit": 50,
    "offset": 0,
    "hasMore": false
  }
}

Update Dataset

Update dataset metadata and configuration.
curl -X PATCH "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "updated-instruction-dataset",
    "description": "Updated description with more context",
    "tags": ["instruction-tuning", "customer-support", "v2"],
    "metadata": {
      "source": "customer_conversations",
      "language": "en",
      "domain": "customer_support",
      "quality_score": 8.5
    }
  }'

Delete Dataset

Delete a dataset and all associated data.
curl -X DELETE "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Dataset Validation

Text Dataset Validation

{
  "validation": {
    "required_fields": ["instruction", "response"],
    "optional_fields": ["context", "category"],
    "max_sequence_length": 2048,
    "min_sequence_length": 10,
    "min_examples": 100,
    "max_examples": 1000000,
    "encoding": "utf-8",
    "language_detection": true
  }
}

Image Dataset Validation

{
  "validation": {
    "supported_formats": ["jpg", "jpeg", "png", "webp"],
    "min_resolution": [224, 224],
    "max_resolution": [4096, 4096],
    "max_file_size": "10MB",
    "min_examples_per_class": 10,
    "check_corruption": true,
    "color_space": "RGB"
  }
}

SDK Examples

Python SDK

from tensorone import TensorOneClient
import pandas as pd

client = TensorOneClient(api_key="YOUR_API_KEY")

# Create dataset
dataset = client.training.datasets.create(
    name="custom-instruction-dataset",
    type="text",
    format="jsonl",
    description="Custom instruction-response pairs",
    validation={
        "required_fields": ["instruction", "response"],
        "max_sequence_length": 2048,
        "min_examples": 100
    }
)

# Upload data
with open("training_data.jsonl", "rb") as f:
    client.training.datasets.upload(dataset.id, f)

# Wait for processing
while dataset.status == "processing":
    dataset = client.training.datasets.get(dataset.id)
    print(f"Processing... {dataset.status}")
    time.sleep(10)

print(f"Dataset ready: {dataset.size.examples} examples")

# List datasets
datasets = client.training.datasets.list(type="text", status="ready")
for ds in datasets:
    print(f"{ds.name}: {ds.size.examples} examples")

JavaScript SDK

import { TensorOneClient } from '@tensorone/sdk';
import fs from 'fs';

const client = new TensorOneClient({ apiKey: 'YOUR_API_KEY' });

// Create dataset
const dataset = await client.training.datasets.create({
  name: 'custom-instruction-dataset',
  type: 'text',
  format: 'jsonl',
  description: 'Custom instruction-response pairs',
  validation: {
    requiredFields: ['instruction', 'response'],
    maxSequenceLength: 2048,
    minExamples: 100
  }
});

// Upload data
const fileStream = fs.createReadStream('training_data.jsonl');
await client.training.datasets.upload(dataset.id, fileStream);

// Monitor processing
const waitForReady = async (datasetId) => {
  const ds = await client.training.datasets.get(datasetId);
  if (ds.status === 'processing') {
    setTimeout(() => waitForReady(datasetId), 10000);
  } else {
    console.log(`Dataset ready: ${ds.size.examples} examples`);
  }
};

waitForReady(dataset.id);

Data Formats

Text Datasets

JSONL Format for Instruction Tuning

{"instruction": "What is machine learning?", "response": "Machine learning is a subset of artificial intelligence..."}
{"instruction": "Explain neural networks", "response": "Neural networks are computational models inspired by..."}

CSV Format for Classification

text,label
"This product is amazing!",positive
"Poor quality, not recommended",negative

Image Datasets

Directory Structure

dataset/
├── class1/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
├── class2/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...

JSONL with Annotations

{"image_path": "images/001.jpg", "label": "cat", "bbox": [10, 20, 100, 150]}
{"image_path": "images/002.jpg", "label": "dog", "bbox": [15, 25, 120, 180]}

Error Handling

Common Errors

{
  "error": "VALIDATION_FAILED",
  "message": "Dataset validation failed",
  "details": {
    "field": "instruction",
    "reason": "Required field missing in 15 examples",
    "examples": [45, 67, 89, 123, 156]
  }
}
{
  "error": "UNSUPPORTED_FORMAT",
  "message": "File format not supported",
  "details": {
    "providedFormat": "xlsx",
    "supportedFormats": ["jsonl", "csv", "parquet"]
  }
}
{
  "error": "QUOTA_EXCEEDED",
  "message": "Storage quota exceeded",
  "details": {
    "currentUsage": "50GB",
    "quotaLimit": "50GB",
    "requestedSize": "5GB"
  }
}

Best Practices

Data Quality

  • Ensure consistent formatting across all examples
  • Remove duplicates and low-quality samples
  • Balance your dataset across different classes or categories
  • Validate data integrity before uploading

Storage Optimization

  • Use compressed formats like Parquet for structured data
  • Optimize image sizes while maintaining quality
  • Remove unnecessary metadata from files
  • Consider data deduplication for large datasets

Security

  • Never include sensitive information in training data
  • Use proper access controls for private datasets
  • Implement data lineage tracking for compliance
  • Regularly audit dataset contents
Dataset processing time varies based on size and complexity. Text datasets typically process within minutes, while large image datasets may take several hours.
Once a dataset is used in a training job, it cannot be deleted. Create a new version if you need to make changes.