Skip to content

Create Dataset

Create a new dataset in Marqtune by posting to /datasets. This returns a presigned URL where the dataset file should be uploaded. When using the py-marqtune client, create_dataset() automatically handles file upload and dataset creation. For REST API usage, upload the file to the presigned URL which triggers the dataset validation and preparation tasks.

There are two primary types of datasets in Marqtune: training and evaluation.

  • Training Dataset: Used for training machine learning models. It includes input data and corresponding attributes.
  • Evaluation Dataset: Used for evaluating the performance of trained models. It includes query data and expected results to test the model's accuracy and effectiveness.

Note: Column names in the dataSchema should not use the prefix marqtune__ as it is reserved for internal use.


POST /datasets

Body Parameters

Name Type Default value Description
datasetType String training Required - Valid dataset types are [evaluation, training].
dataSchema Dictionary "" Required - Mapping of columns to data types. See DataSchema Details for more information.
queryColumn String "" Required - if datasetType is evaluation.
resultColumns String "" Required - if datasetType is evaluation.
imageDownloadHeaders String "" Optional - Headers for the image download. Can be used to authenticate the images for download.
waitForCompletion Boolean True Optional[py-marqtune client only] - Instructs the client to continuously wait and poll until the operation is completed.

DataSchema Details

The dataSchema object specifies the mapping of CSV column names to data types. Below are the details for the valid types:

Type Description
image_pointer A string representing the path or URL to the image.
text Can be any value, including strings or integers.
score An int or a float representing a numeric score associated with the entry.
doc_id A string used to identify documents (result_columns) in evaluation datasets. If left unspecified Marqtune will generate unique ids based on sha256 hashes of the content.

Ensure that the dataSchema defined matches the structure and data types of your input CSV file.

The dataset file should be in CSV format and must follow the structure specified in the dataSchema.

Given the following dataSchema:

data_schema = {"my_image": "image_pointer", "my_text": "text", "my_scores": "score"}
The CSV file should look like this:

my_image,my_text,my_scores
path/to/image1.jpg,"This is a sample text",0.9
path/to/image2.jpg,"Another sample text",0.8
path/to/image3.jpg,"More text",0.95

Example: Training Dataset

from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType

url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)

data_schema = {
    "my_image": "image_pointer",
    "my_text": "text",
    "my_scores": "score"
}

marqtune_client.create_dataset(
        file_path="path_to_file",
        dataset_name="dataset_name",
        dataset_type=DatasetType.TRAINING,
        data_schema=data_schema,
        wait_for_completion=True
    )
# Create a dataset.
curl -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
    "datasetType": "evaluation"
    "dataSchema":[
        {
            "my_image": "image_pointer",
            "my_text": "text",
            "my_scores": "score"
        }
    ]
   }

Example: Evaluation Dataset

from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType

url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)

data_schema = {
    "my_image": "image_pointer",
    "my_text": "text",
    "my_query": "text",
    "my_scores": "score" # Optional
    "docid": "doc_id" # Optional - uniquely identifies the document (`my_image` +`my_text` in this example)
}

query_column = "my_query"

result_columns = [
    "my_image_2",
    "my_text_2"
]

marqtune_client.create_dataset(
        file_path="path_to_file",
        dataset_name="dataset_name",
        dataset_type=DatasetType.EVALUATION,
        data_schema=data_schema,
        query_column=query_column,
        result_columns=result_columns,
        wait_for_completion=True
    )
# Create a dataset.
curl -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
    "datasetType": "evaluation"
    "dataSchema":[
        {
            "my_image": "image_pointer",
            "my_text": "text",
            "my_query": "text",
            "my_scores": "score", # Optional if datasetType is evaluation.
            "docid": "doc_id" # Optional - uniquely identifies the document (`my_image` +`my_text` in this example)
        }
    ],
    "queryColumn": "my_query",
    "resultColumns":[
            "my_image",
            "my_text"
    ]
   }

Response: 202 Accepted

Dataset creation task has been created and is now waiting for file to be uploaded.

    {
        "statusCode": 202,
        "body": {
            "uploadUrl": "upload_url",
            "datasetId": "datasetId"
        }
    }

Response: 400 (Bad Request)

Required parameters not present or body is incorrect.

{
    "statusCode": 400,
    "body": {
      "message": "Invalid arguments in request body or query parameters"
    }
}

Response: 400 (Invalid Request)

Request path or method is invalid.

{
    "statusCode": 400,
    "body": {
      "message": "Invalid request method"
    }
}

Response: 401 (Unauthorised)

Unauthorised. Check your API key and try again.

{
  "message": "Unauthorized."
}

Response: 500 (Internal server error)

Internal server error. Check your API key and try again.

{
  "message": "Internal server error."
}