Create Dataset

Create a new dataset in Marqtune by posting to /datasets. This returns a presigned URL where the dataset file should be uploaded. When using the py-marqtune client, create_dataset() automatically handles file upload and dataset creation. For REST API usage, upload the file to the presigned URL which triggers the dataset validation and preparation tasks.

There are two primary types of datasets in Marqtune: training and evaluation.

Training Dataset: Used for training machine learning models. It includes input data and corresponding attributes.
Evaluation Dataset: Used for evaluating the performance of trained models. It includes query data and expected results to test the model's accuracy and effectiveness.

Note: Column names in the dataSchema should not use the prefix marqtune__ as it is reserved for internal use.

POST /datasets

Body Parameters

Name	Type	Default value	Description
`datasetType`	String	`training`	Required - Valid dataset types are [evaluation, training].
`dataSchema`	Dictionary	`""`	Required - Mapping of columns to data types. See DataSchema Details for more information.
`queryColumn`	String	`""`	Required - if datasetType is evaluation.
`resultColumns`	String	`""`	Required - if datasetType is evaluation.
`imageDownloadHeaders`	String	`""`	Optional - Headers for the image download. Can be used to authenticate the images for download.
`normalizeUrls`	Boolean	`True`	Optional - Normalizes URLs in the dataset by converting them into a consistent, standardized format (e.g., encoding special characters, removing redundant elements). Use this when your dataset contains raw or unprocessed URLs. Set to `False` if the URLs in your dataset are already preprocessed or encoded to avoid unnecessary transformations.
`waitForCompletion`	Boolean	`True`	Optional[py-marqtune client only] - Instructs the client to continuously wait and poll until the operation is completed.

DataSchema Details

The dataSchema object specifies the mapping of CSV column names to data types. Below are the details for the valid types:

Type	Description
`image_pointer`	A string representing the path or URL to the image.
`text`	Can be any value, including strings or integers.
`score`	An int or a float representing a numeric score associated with the entry.
`doc_id`	A string used to identify documents (`result_columns`) in evaluation datasets. If left unspecified Marqtune will generate unique ids based on sha256 hashes of the content.

Ensure that the dataSchema defined matches the structure and data types of your input CSV file.

The dataset file should be in CSV format and must follow the structure specified in the dataSchema.

Given the following dataSchema:

data_schema = {"my_image": "image_pointer", "my_text": "text", "my_scores": "score"}

The CSV file should look like this:

my_image,my_text,my_scores
path/to/image1.jpg,"This is a sample text",0.9
path/to/image2.jpg,"Another sample text",0.8
path/to/image3.jpg,"More text",0.95

Example: Training Dataset

PythoncURL

from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType

url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)

data_schema = {
    "my_image": "image_pointer",
    "my_text": "text",
    "my_scores": "score"
}

marqtune_client.create_dataset(
        file_path="path_to_file",
        dataset_name="dataset_name",
        dataset_type=DatasetType.TRAINING,
        data_schema=data_schema,
        wait_for_completion=True
    )

# Create a dataset.
curl -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
    "datasetType": "evaluation"
    "dataSchema":[
        {
            "my_image": "image_pointer",
            "my_text": "text",
            "my_scores": "score"
        }
    ]
   }

Example: Evaluation Dataset

PythoncURL

from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType

url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)

data_schema = {
    "my_image": "image_pointer",
    "my_text": "text",
    "my_query": "text",
    "my_scores": "score" # Optional
    "docid": "doc_id" # Optional - uniquely identifies the document (`my_image` +`my_text` in this example)
}

query_column = "my_query"

result_columns = [
    "my_image_2",
    "my_text_2"
]

marqtune_client.create_dataset(
        file_path="path_to_file",
        dataset_name="dataset_name",
        dataset_type=DatasetType.EVALUATION,
        data_schema=data_schema,
        query_column=query_column,
        result_columns=result_columns,
        wait_for_completion=True
    )

# Create a dataset.
curl -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
    "datasetType": "evaluation"
    "dataSchema":[
        {
            "my_image": "image_pointer",
            "my_text": "text",
            "my_query": "text",
            "my_scores": "score", # Optional if datasetType is evaluation.
            "docid": "doc_id" # Optional - uniquely identifies the document (`my_image` +`my_text` in this example)
        }
    ],
    "queryColumn": "my_query",
    "resultColumns":[
            "my_image",
            "my_text"
    ]
   }

Response: `202 Accepted`

Dataset creation task has been created and is now waiting for file to be uploaded.

    {
        "statusCode": 202,
        "body": {
            "uploadUrl": "upload_url",
            "datasetId": "datasetId"
        }
    }

Response: `400 (Bad Request)`

Required parameters not present or body is incorrect.

{
    "statusCode": 400,
    "body": {
      "message": "Invalid arguments in request body or query parameters"
    }
}

Response: `400 (Invalid Request)`

Request path or method is invalid.

{
    "statusCode": 400,
    "body": {
      "message": "Invalid request method"
    }
}

Response: `401 (Unauthorised)`

Unauthorised. Check your API key and try again.

{
  "message": "Unauthorized."
}

Response: `500 (Internal server error)`

Internal server error. Check your API key and try again.

{
  "message": "Internal server error."
}

Create Dataset

Body Parameters

DataSchema Details

Example: Training Dataset

Example: Evaluation Dataset

Response: 202 Accepted

Response: 400 (Bad Request)

Response: 400 (Invalid Request)

Response: 401 (Unauthorised)

Response: 500 (Internal server error)

Response: `202 Accepted`

Response: `400 (Bad Request)`

Response: `400 (Invalid Request)`

Response: `401 (Unauthorised)`

Response: `500 (Internal server error)`