Text Search with Marqo
Introduction
This guide will walk you through using Marqo to index and search a dataset from Simple Wikipedia. We'll break down the process step by step to make it easy to understand and follow along.
First, select your platform:
Introduction
This guide will walk you through using Marqo to index and search a text dataset from Simple Wikipedia.
Full code:
If you have any questions or need help, visit our Community and ask in the get-help
channel.
Getting Started
Before we begin, there are a few preliminary steps to ensure you have everything needed for this demo.
Step 1: Download the Dataset
First, download the Simple Wikipedia dataset. You can find it here: Simple Wikipedia Dataset.
Step 2: Obtain Marqo Cloud API Key
Next, we need to obtain our Marqo Cloud API Key. For more information on how you can obtain this, visit our article. Once you have obtained this, replace your_api_key
in text_search_cloud.py
with your actual key:
api_key = "your_api_key"
Let's now dive into the code. The script is broken down into several steps to make it easier to understand and manage.
Step 3: Import and Helper Functions
Before we start indexing, we need to set up our environment with the necessary imports and helper functions.
from marqo import Client
import json
import math
import numpy as np
import copy
import pprint
# Reads a JSON file and returns its content as a dictionary.
def read_json(filename: str) -> dict:
with open(filename, "r", encoding="utf-8") as f:
data = json.load(f)
return data
# Removes the wikipedia from the title for better matching
def clean_data(data: dict) -> dict:
data["title"] = data["title"].replace("- Wikipedia", "")
# Convert docDate to string
data["docDate"] = str(data["docDate"])
return data
# Split larger documents
def split_big_docs(data, field="content", char_len=5e4):
new_data = []
for dat in data:
content = dat[field]
N = len(content)
if N >= char_len:
n_chunks = math.ceil(N / char_len)
new_content = np.array_split(list(content), n_chunks)
for _content in new_content:
new_dat = copy.deepcopy(dat)
new_dat[field] = "".join(_content)
new_data.append(new_dat)
else:
new_data.append(dat)
return new_data
Step 4: Load the Data
After setting up our imports and helper functions, the next step is to load our dataset.
# Load dataset file - Change this to where your 'simplewiki.json' is located
dataset_file = "./starter-guides/text-search/simplewiki.json"
# Get the data
data = read_json(dataset_file)
# Clean up the title
data = [clean_data(d) for d in data]
data = split_big_docs(data)
# Take the first 100 entries of the dataset
N = 100 # Number of entries of the dataset
subset_data = data[:N]
print(f"loaded data with {len(data)} entries")
print(f"creating subset with {len(subset_data)} entries")
Step 5: Index the Data with Marqo
With our data loaded, we can now create an index in Marqo and add our documents to it.
# Replace this with your API Key
api_key = "your_api_key"
# Name your index
index_name = "text-search-cloud"
# Set up the Client
client = Client("https://api.marqo.ai", api_key=api_key)
# We create the index. Note if it already exists an error will occur
# as you cannot overwrite an existing index. For this reason, we delete
# any existing index
try:
client.delete_index(index_name)
except:
pass
# Create index
client.create_index(index_name, model="e5-base-v2")
# Add the data to the index
responses = client.index(index_name).add_documents(
data, client_batch_size=50, tensor_fields=["title", "content"]
)
# Optionally take a look at the responses
pprint.pprint(responses)
Step 6: Searching with Marqo
Now that our data is indexed, we can perform searches on it.
# Create a query
query = "what is air made of?"
# Obtain results for this query from the Marqo index
results = client.index(index_name).search(query)
# We can check the results - let's look at the top hit
pprint.pprint(results["hits"][0])
# We also get highlighting which tells us why this article was returned
pprint.pprint(results["hits"][0]["_highlights"])
# We use lexical search instead of tensor search
results = client.index(index_name).search(query, search_method="LEXICAL")
# We can check the results - lets look at the top hit
pprint.pprint(results["hits"][0])
The top hit when performing a search to the question "what is air made of?"
returns:
Air refers to the Earth's atmosphere. Air is a mixture of many "
'gases and tiny dust particles. It is the clear gas in which '
'living things live and breathe. It has an indefinite shape and '
'volume. It has mass and weight, because it is matter. The weight '
'of air creates atmospheric pressure. There is no air in outer '
'space. \r'
'\r'
'Air is a mixture of about 78% of nitrogen, 21% of oxygen, 0.9% of '
'argon, 0.04% of carbon dioxide, and very small amounts of other '
'gases.
Full Code
text_search_cloud.py
#####################################################
### STEP 1. Setting UP
#####################################################
# 1. Sign Up to Marqo Cloud: https://cloud.marqo.ai/authentication/register/
# 2. Get a Marqo API Key: https://www.marqo.ai/blog/finding-my-marqo-api-key
#####################################################
### STEP 2. Import and Define any Helper Functions
#####################################################
from marqo import Client
import json
import math
import numpy as np
import copy
import pprint
def read_json(filename: str) -> dict:
"""
Reads a JSON file and returns its content as a dictionary.
Args:
filename (str): The path to the JSON file.
Returns:
dict: The content of the JSON file as a dictionary.
"""
with open(filename, 'r', encoding='utf-8') as f:
data = json.load(f)
return data
def clean_data(data: dict) -> dict:
"""
Cleans the data by removing '- Wikipedia' from the title and converting docDate to a string.
Args:
data (dict): The input data dictionary with keys 'title' and 'docDate'.
Returns:
dict: The cleaned data dictionary.
"""
data['title'] = data['title'].replace('- Wikipedia', '')
data["docDate"] = str(data["docDate"])
return data
def split_big_docs(data, field='content', char_len=5e4):
"""
Splits large documents into smaller chunks based on a specified character length.
Args:
data (list): A list of dictionaries, each containing a 'content' field or specified field.
field (str, optional): The field name to check for length. Default is 'content'.
char_len (float, optional): The maximum character length for each chunk. Default is 5e4.
Returns:
list: A list of dictionaries, each containing a chunked version of the original content.
"""
new_data = []
for dat in data:
content = dat[field]
N = len(content)
if N >= char_len:
n_chunks = math.ceil(N / char_len)
new_content = np.array_split(list(content), n_chunks)
for _content in new_content:
new_dat = copy.deepcopy(dat)
new_dat[field] = ''.join(_content)
new_data.append(new_dat)
else:
new_data.append(dat)
return new_data
#####################################################
### STEP 3. Load the Data
#####################################################
# Load dataset file
# Change this to where your 'simplewiki.json' is located
dataset_file = "./starter-guides/text-search/simplewiki.json"
# Get the data
data = read_json(dataset_file)
# Clean up the title
data = [clean_data(d) for d in data]
data = split_big_docs(data)
# Take the first 100 entries of the dataset
N = 100 # Number of entries of the dataset
subset_data = data[:N]
print(f"loaded data with {len(data)} entries")
print(f"creating subset with {len(subset_data)} entries")
#####################################################
### STEP 4. Index Some Data with Marqo
#####################################################
# Replace this with your API Key
api_key = "your_api_key"
# Name your index
index_name = 'text-search-cloud'
# Set up the Client
client = Client(
"https://api.marqo.ai",
api_key=api_key
)
# We create the index. Note if it already exists an error will occur
# as you cannot overwrite an existing index. For this reason, we delete
# any existing index
try:
client.delete_index(index_name)
except:
pass
# Create index
client.create_index(
index_name,
model='e5-base-v2'
)
# Add the subset of data to the index
responses = client.index(index_name).add_documents(
subset_data,
client_batch_size=50,
tensor_fields=["title", "content"]
)
# Optionally take a look at the responses
# pprint.pprint(responses)
#####################################################
### STEP 5. Search with Marqo
#####################################################
# Create a query
query = 'what is air made of?'
# Obtain results for this query from the Marqo index
results = client.index(index_name).search(query)
# We can check the results - let's look at the top hit
pprint.pprint(results['hits'][0])
# We also get highlighting which tells us why this article was returned
pprint.pprint(results['hits'][0]['_highlights'])
# We use lexical search instead of tensor search
results = client.index(index_name).search(query, search_method='LEXICAL')
# We can check the lexical results - lets look at the top hit
pprint.pprint(results['hits'][0])
Introduction
This guide will walk you through using Marqo to index and search a text dataset from Simple Wikipedia.
Full code: Text Search with Marqo Open Source Code
If you have any questions or need help, visit our Community and ask in the get-help
channel.
Getting Started
Before we begin, there are a few preliminary steps to ensure you have everything needed for this demo.
Step 1: Download the Dataset
First, download the Simple Wikipedia dataset. You can find it here: Simple Wikipedia Dataset.
Make sure to place it into the directory where you will perform your coding.
Step 2: Start Marqo
Next, we need to get Marqo up and running. You can do this by executing the following command in your terminal:
docker pull marqoai/marqo:latest
docker rm -f marqo
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
For more detailed instructions, check the Installation Guide.
Step 3: Run the Demo Script
Once Marqo is running, you can execute the text_search_open_source.py
script. You can find this script on our GitHub here or at the bottom of this page.
python3 text_search_open_source.py
Note: Indexing can take some time depending on your computer.
Code Walkthrough
Let's dive into the code. The script is broken down into several steps to make it easier to understand and manage.
Step 1: Start Marqo
This step assumes you have started the Marqo server as described above and in the Installation section.
Step 2: Import and Helper Functions
Before we start indexing, we need to set up our environment with the necessary imports and helper functions.
from marqo import Client
import json
import math
import numpy as np
import copy
import pprint
# Reads a JSON file and returns its content as a dictionary.
def read_json(filename: str) -> dict:
with open(filename, "r", encoding="utf-8") as f:
data = json.load(f)
return data
# Removes the wikipedia from the title for better matching
def clean_data(data: dict) -> dict:
data["title"] = data["title"].replace("- Wikipedia", "")
# Convert docDate to string
data["docDate"] = str(data["docDate"])
return data
# Split larger documents
def split_big_docs(data, field="content", char_len=5e4):
new_data = []
for dat in data:
content = dat[field]
N = len(content)
if N >= char_len:
n_chunks = math.ceil(N / char_len)
new_content = np.array_split(list(content), n_chunks)
for _content in new_content:
new_dat = copy.deepcopy(dat)
new_dat[field] = "".join(_content)
new_data.append(new_dat)
else:
new_data.append(dat)
return new_data
Step 3: Load the Data
After setting up our imports and helper functions, the next step is to load our dataset.
# Load dataset file - Change this to where your 'simplewiki.json' is located
dataset_file = "./starter-guides/text-search/simplewiki.json"
# Get the data
data = read_json(dataset_file)
# Clean up the title
data = [clean_data(d) for d in data]
data = split_big_docs(data)
# Take the first 100 entries of the dataset
N = 100 # Number of entries of the dataset
subset_data = data[:N]
print(f"loaded data with {len(data)} entries")
print(f"creating subset with {len(subset_data)} entries")
Step 4: Index the Data with Marqo
With our data loaded, we can now create an index in Marqo and add our documents to it.
# Name your index
index_name = "text-search-open-source"
# Set up the Client
client = Client("http://localhost:8882")
# We create the index. Note if it already exists an error will occur
# as you cannot overwrite an existing index. For this reason, we delete
# any existing index
try:
client.delete_index(index_name)
except:
pass
# Create index
client.create_index(index_name, model="e5-base-v2")
# Add the data to the index
responses = client.index(index_name).add_documents(
data, client_batch_size=50, tensor_fields=["title", "content"]
)
# Optionally take a look at the responses
pprint.pprint(responses)
Step 5: Searching with Marqo
Now that our data is indexed, we can perform searches on it.
# Create a query
query = "what is air made of?"
# Obtain results for this query from the Marqo index
results = client.index(index_name).search(query)
# We can check the results - let's look at the top hit
pprint.pprint(results["hits"][0])
# We also get highlighting which tells us why this article was returned
pprint.pprint(results["hits"][0]["_highlights"])
# We use lexical search instead of tensor search
results = client.index(index_name).search(query, search_method="LEXICAL")
# We can check the results - lets look at the top hit
pprint.pprint(results["hits"][0])
The top hit when performing a search to the question "what is air made of?"
returns:
Air refers to the Earth's atmosphere. Air is a mixture of many "
'gases and tiny dust particles. It is the clear gas in which '
'living things live and breathe. It has an indefinite shape and '
'volume. It has mass and weight, because it is matter. The weight '
'of air creates atmospheric pressure. There is no air in outer '
'space. \r'
'\r'
'Air is a mixture of about 78% of nitrogen, 21% of oxygen, 0.9% of '
'argon, 0.04% of carbon dioxide, and very small amounts of other '
'gases.
Full Code
GitHub Code: Text Search with Marqo Open Source Code
text_search_open_source.py
#####################################################
### STEP 1. Start Marqo
#####################################################
# 1. Marqo requires Docker. To install Docker go to Docker
# Docs and install for your operating system (Mac, Windows, Linux).
# 2. Once Docker is installed, you can use it to run Marqo.
# First, open the Docker application and then head to your
# terminal and enter the following:
"""
docker pull marqoai/marqo:latest
docker rm -f marqo
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
"""
#####################################################
### STEP 2. Import and Define any Helper Functions
#####################################################
from marqo import Client
import json
import math
import numpy as np
import copy
import pprint
def read_json(filename: str) -> dict:
"""
Reads a JSON file and returns its content as a dictionary.
Args:
filename (str): The path to the JSON file.
Returns:
dict: The content of the JSON file as a dictionary.
"""
with open(filename, 'r', encoding='utf-8') as f:
data = json.load(f)
return data
def clean_data(data: dict) -> dict:
"""
Cleans the data by removing '- Wikipedia' from the title and converting docDate to a string.
Args:
data (dict): The input data dictionary with keys 'title' and 'docDate'.
Returns:
dict: The cleaned data dictionary.
"""
data['title'] = data['title'].replace('- Wikipedia', '')
data["docDate"] = str(data["docDate"])
return data
def split_big_docs(data, field='content', char_len=5e4):
"""
Splits large documents into smaller chunks based on a specified character length.
Args:
data (list): A list of dictionaries, each containing a 'content' field or specified field.
field (str, optional): The field name to check for length. Default is 'content'.
char_len (float, optional): The maximum character length for each chunk. Default is 5e4.
Returns:
list: A list of dictionaries, each containing a chunked version of the original content.
"""
new_data = []
for dat in data:
content = dat[field]
N = len(content)
if N >= char_len:
n_chunks = math.ceil(N / char_len)
new_content = np.array_split(list(content), n_chunks)
for _content in new_content:
new_dat = copy.deepcopy(dat)
new_dat[field] = ''.join(_content)
new_data.append(new_dat)
else:
new_data.append(dat)
return new_data
#####################################################
### STEP 3. Load the Data
#####################################################
# Load dataset file
# Change this to where your 'simplewiki.json' is located
dataset_file = "./starter-guides/text-search/simplewiki.json"
# Get the data
data = read_json(dataset_file)
# Clean up the title
data = [clean_data(d) for d in data]
data = split_big_docs(data)
# Take the first 100 entries of the dataset
N = 100 # Number of entries of the dataset
subset_data = data[:N]
print(f"loaded data with {len(data)} entries")
print(f"creating subset with {len(subset_data)} entries")
#####################################################
### STEP 4. Index Some Data with Marqo
#####################################################
# Name your index
index_name = 'text-search-open-source'
# Set up the Client
client = Client("http://localhost:8882")
# We create the index. Note if it already exists an error will occur
# as you cannot overwrite an existing index. For this reason, we delete
# any existing index
try:
client.delete_index(index_name)
except:
pass
# Create index
client.create_index(
index_name,
model='e5-base-v2'
)
# Add the subset of data to the index
responses = client.index(index_name).add_documents(
subset_data,
client_batch_size=50,
tensor_fields=["title", "content"]
)
# Optionally take a look at the responses
# pprint.pprint(responses)
#####################################################
### STEP 5. Search with Marqo
#####################################################
# Create a query
query = 'what is air made of?'
# Obtain results for this query from the Marqo index
results = client.index(index_name).search(query)
# We can check the results - let's look at the top hit
pprint.pprint(results['hits'][0])
# We also get highlighting which tells us why this article was returned
pprint.pprint(results['hits'][0]['_highlights'])
# We use lexical search instead of tensor search
results = client.index(index_name).search(query, search_method='LEXICAL')
# We can check the lexical results - lets look at the top hit
pprint.pprint(results['hits'][0])