Create Dataset
Create a new dataset in Marqtune by posting to /datasets
. This returns a presigned URL where the dataset file
should be uploaded. When using the py-marqtune client, create_dataset()
automatically handles file upload and
dataset creation. For REST API usage, upload the file to the presigned URL which triggers the dataset validation and
preparation tasks.
There are two primary types of datasets in Marqtune: training and evaluation.
- Training Dataset: Used for training machine learning models. It includes input data and corresponding attributes.
- Evaluation Dataset: Used for evaluating the performance of trained models. It includes query data and expected results to test the model's accuracy and effectiveness.
Note: Column names in the dataSchema should not use the prefix marqtune__ as it is reserved for internal use.
POST /datasets
Body Parameters
Name | Type | Default value | Description |
---|---|---|---|
datasetType |
String | training |
Required - Valid dataset types are [evaluation, training]. |
dataSchema |
Dictionary | "" |
Required - Mapping of columns to data types. See DataSchema Details for more information. |
queryColumn |
String | "" |
Required - if datasetType is evaluation. |
resultColumns |
String | "" |
Required - if datasetType is evaluation. |
imageDownloadHeaders |
String | "" |
Optional - Headers for the image download. Can be used to authenticate the images for download. |
normalizeUrls |
Boolean | True |
Optional - Normalizes URLs in the dataset by converting them into a consistent, standardized format (e.g., encoding special characters, removing redundant elements). Use this when your dataset contains raw or unprocessed URLs. Set to False if the URLs in your dataset are already preprocessed or encoded to avoid unnecessary transformations. |
waitForCompletion |
Boolean | True |
Optional[py-marqtune client only] - Instructs the client to continuously wait and poll until the operation is completed. |
DataSchema Details
The dataSchema object specifies the mapping of CSV column names to data types. Below are the details for the valid types:
Type | Description |
---|---|
image_pointer |
A string representing the path or URL to the image. |
text |
Can be any value, including strings or integers. |
score |
An int or a float representing a numeric score associated with the entry. |
doc_id |
A string used to identify documents (result_columns ) in evaluation datasets. If left unspecified Marqtune will generate unique ids based on sha256 hashes of the content. |
Ensure that the dataSchema defined matches the structure and data types of your input CSV file.
The dataset file should be in CSV format and must follow the structure specified in the dataSchema.
Given the following dataSchema:
data_schema = {"my_image": "image_pointer", "my_text": "text", "my_scores": "score"}
my_image,my_text,my_scores
path/to/image1.jpg,"This is a sample text",0.9
path/to/image2.jpg,"Another sample text",0.8
path/to/image3.jpg,"More text",0.95
Example: Training Dataset
from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType
url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)
data_schema = {
"my_image": "image_pointer",
"my_text": "text",
"my_scores": "score"
}
marqtune_client.create_dataset(
file_path="path_to_file",
dataset_name="dataset_name",
dataset_type=DatasetType.TRAINING,
data_schema=data_schema,
wait_for_completion=True
)
# Create a dataset.
curl -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
"datasetType": "evaluation"
"dataSchema":[
{
"my_image": "image_pointer",
"my_text": "text",
"my_scores": "score"
}
]
}
Example: Evaluation Dataset
from marqtune.client import Client
from marqtune.enums import ModelType, DatasetType, InstanceType
url = "https://marqtune.marqo.ai"
api_key = "{api_key}"
marqtune_client = Client(url=url, api_key=api_key)
data_schema = {
"my_image": "image_pointer",
"my_text": "text",
"my_query": "text",
"my_scores": "score" # Optional
"docid": "doc_id" # Optional - uniquely identifies the document (`my_image` +`my_text` in this example)
}
query_column = "my_query"
result_columns = [
"my_image_2",
"my_text_2"
]
marqtune_client.create_dataset(
file_path="path_to_file",
dataset_name="dataset_name",
dataset_type=DatasetType.EVALUATION,
data_schema=data_schema,
query_column=query_column,
result_columns=result_columns,
wait_for_completion=True
)
# Create a dataset.
curl -X POST 'https://marqtune.marqo.ai/datasets' \
-H "Content-Type: application/json" \
-H 'x-api-key: {api_key}' \
-d '{
"datasetType": "evaluation"
"dataSchema":[
{
"my_image": "image_pointer",
"my_text": "text",
"my_query": "text",
"my_scores": "score", # Optional if datasetType is evaluation.
"docid": "doc_id" # Optional - uniquely identifies the document (`my_image` +`my_text` in this example)
}
],
"queryColumn": "my_query",
"resultColumns":[
"my_image",
"my_text"
]
}
Response: 202 Accepted
Dataset creation task has been created and is now waiting for file to be uploaded.
{
"statusCode": 202,
"body": {
"uploadUrl": "upload_url",
"datasetId": "datasetId"
}
}
Response: 400 (Bad Request)
Required parameters not present or body is incorrect.
{
"statusCode": 400,
"body": {
"message": "Invalid arguments in request body or query parameters"
}
}
Response: 400 (Invalid Request)
Request path or method is invalid.
{
"statusCode": 400,
"body": {
"message": "Invalid request method"
}
}
Response: 401 (Unauthorised)
Unauthorised. Check your API key and try again.
{
"message": "Unauthorized."
}
Response: 500 (Internal server error)
Internal server error. Check your API key and try again.
{
"message": "Internal server error."
}