Skip to content

Create Dataset

Create a new dataset in Marqtune by posting to /datasets. This returns a presigned URL where the dataset file should be uploaded. When using the py-marqtune client, create_dataset() automatically handles file upload and dataset creation. For REST API usage, upload the file to the presigned URL which triggers the dataset validation and preparation tasks.

There are two primary types of datasets in Marqtune: training and evaluation.

  • Training Dataset: Used for training machine learning models. It includes input data and corresponding attributes.
  • Evaluation Dataset: Used for evaluating the performance of trained models. It includes query data and expected results to test the model's accuracy and effectiveness.

POST /datasets

Body Parameters

Name Type Default value Description
datasetType String training Required - Valid dataset types are [evaluation, training].
dataSchema Dictionary "" Required - Mapping of columns to data types. dataSchema defined must match input file schema.
queryColumn String "" Required - if datasetType is evaluation.
resultColumns String "" Required - if datasetType is evaluation.
imageDownloadHeaders String "" Optional - Headers for the image download. Can be used to authenticate the images for download.
waitForCompletion Boolean True Optional[py-marqtune client only] - Instructs the client to continuously wait and poll until the operation is completed.

The dataset file should be in CSV format and must follow the structure specified in the dataSchema.

Given the following dataSchema:

data_schema = {"my_image": "image_pointer", "my_text": "text", "my_scores": "score"}
The CSV file should look like this:

my_image,my_text,my_scores
path/to/image1.jpg,"This is a sample text",0.9
path/to/image2.jpg,"Another sample text",0.8
path/to/image3.jpg,"More text",0.95

Example: Training Dataset

from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType

url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)

data_schema = {
    "my_image": "image_pointer",
    "my_text": "text",
    "my_scores": "score"
}

marqtune_client.create_dataset(
        file_path="path_to_file",
        dataset_name="dataset_name", 
        dataset_type=DatasetType.TRAINING, 
        data_schema=data_schema,
        wait_for_completion=True
    )
# Create a dataset.
cURL -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
    "datasetType": "evaluation"
    "dataSchema":[
        {
            "my_image": "image_pointer",
            "my_text": "text",
            "my_scores": "score"
        }
    ]
   }

Example: Evaluation Dataset

from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType

url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)

data_schema = {
    "my_image": "image_pointer",
    "my_text": "text",
    "my_query": "text",
    "my_scores": "score" # Optional if datasetType is evaluation
}

query_column = "my_query"

result_columns = [
    "my_image_2",
    "my_text_2"
]

marqtune_client.create_dataset(
        file_path="path_to_file",
        dataset_name="dataset_name", 
        dataset_type=DatasetType.EVALUATION, 
        data_schema=data_schema,
        query_column=query_column, 
        result_columns=result_columns,
        wait_for_completion=True
    )
# Create a dataset.
cURL -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
    "datasetType": "evaluation"
    "dataSchema":[
        {
            "my_image": "image_pointer",
            "my_text": "text",
            "my_query": "text",
            "my_scores": "score" # Optional if datasetType is evaluation.
        }
    ],
    "queryColumn": "my_query",
    "resultColumns":[
            "my_image_2", 
            "my_text_2"
    ]
   }

Response: 202 Accepted

Dataset creation task has been created and is now waiting for file to be uploaded.

    {
        "statusCode": 202,
        "body": {
            "uploadUrl": "upload_url",
            "datasetId": "datasetId"
        }
    }

Response: 400 (Bad Request)

Required parameters not present or body is incorrect.

{
    "statusCode": 400,
    "body": {
      "message": "Invalid arguments in request body or query parameters"
    }
}

Response: 400 (Invalid Request)

Request path or method is invalid.

{
    "statusCode": 400,
    "body": {
      "message": "Invalid request method"
    }
}

Response: 401 (Unauthorised)

Unauthorised. Check your API key and try again.

{
  "message": "Unauthorized."
}

Response: 500 (Internal server error)

Internal server error. Check your API key and try again.

{
  "message": "Internal server error."
}