Indexing a Large Text File with Marqo
In cases where you want queries to return only a specific part of a document it can make sense to deconstruct documents into components before indexing.
Understanding the Term 'Document' in Marqo
In the context of Marqo, a 'document' is an entry that is indexed and can be a variety of things: an image, a pair of image and text, multiple paragraphs, a single sentence, etc. This can be different from the traditional sense of the word.
For this example, we will treat each sentence of a large text file as an individual 'document'. We'll refer to the original text file as 'source material' and the indexed entries as 'documents'.
Document Size Considerations
Marqo defaults to a size limit of 100,000 bytes for documents. Although adjustable, querying large documents may not be as efficient as querying smaller, component parts.
Walkthrough: Indexing a Large Text File
First, select your platform:
Step 1: Get Marqo Cloud API Key
First, we need to obtain our Marqo Cloud API Key. For more information on how you can obtain this, visit our article. Once you have obtained this, replace your_api_key
with your actual API Key:
api_key = "your_api_key"
Step 2: Retrieve and Process the Source Material
First, we gather the text we want to index. In this example, we're using "Alice in Wonderland" from Project Gutenberg.
import urllib.request
from nltk.tokenize import sent_tokenize
from nltk.util import ngrams
from typing import List
import nltk
nltk.download("punkt")
# Get the source material as a string
source_material = ""
for line in urllib.request.urlopen("https://www.gutenberg.org/cache/epub/11/pg11.txt"):
source_material += line.decode("utf-8") + " "
# Define a function to process the text into documents
def process_text(text: str, n: int) -> List[str]:
# Simplify whitespace and remove underscores (used in Gutenberg for formatting)
text = " ".join(text.split()).replace("_", "")
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Create n-grams of sentences to form our documents
return [". ".join(gram) for gram in ngrams(sentences, n)]
# Process the source material into documents consisting of 2 sentences each
documents = process_text(source_material, 2)
Step 3: Initialize the Marqo Client
Before indexing, initialize the Marqo client by specifying the URL where Marqo is running.
import marqo
# Initialize the Marqo client
mq = Client("https://api.marqo.ai", api_key=api_key)
Step 4: Index the Documents
Now, we'll index the documents we've created. Each document is a small part of the source material, making it more searchable.
print("Indexing, please wait...")
# Create an index called 'broken-down-source-material'
mq.create_index("broken-down-source-material")
# Add the documents to the index
mq.index("broken-down-source-material").add_documents(
[
{"Content": text, "_id": f"Alice_in_Wonderland_{idx}"}
for idx, text in enumerate(documents)
],
client_batch_size=64,
tensor_fields=["Content"],
)
Step 5: Search the Indexed Documents
Finally, let's search the index to find relevant documents. Here, we're looking for documents that mention something a caterpillar said.
import pprint
# Search the index for documents matching our query
results = mq.index("broken-down-source-material").search(
q="I am after the things that are said by a Caterpillar", limit=2
)
# Print out the search results
pprint.pprint(results)
You can modify the script to fit your data and search requirements.
Step 1: Start Marqo
-
To begin, run the Marqo Docker container:
For more detailed instructions, see the getting started guide.docker rm -f marqo;docker run --name marqo -it -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:2.0.0
-
Run the
indexing_a_large_text_file.py
file which can be found here. Please note that indexing may take some time depending on your system:python3 indexing_a_large_text_file.py
Step 2: Retrieve and Process the Source Material
First, we gather the text we want to index. In this example, we're using "Alice in Wonderland" from Project Gutenberg.
import urllib.request
from nltk.tokenize import sent_tokenize
from nltk.util import ngrams
from typing import List
import nltk
nltk.download("punkt")
# Get the source material as a string
source_material = ""
for line in urllib.request.urlopen("https://www.gutenberg.org/cache/epub/11/pg11.txt"):
source_material += line.decode("utf-8") + " "
# Define a function to process the text into documents
def process_text(text: str, n: int) -> List[str]:
# Simplify whitespace and remove underscores (used in Gutenberg for formatting)
text = " ".join(text.split()).replace("_", "")
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Create n-grams of sentences to form our documents
return [". ".join(gram) for gram in ngrams(sentences, n)]
# Process the source material into documents consisting of 2 sentences each
documents = process_text(source_material, 2)
Step 3: Initialize the Marqo Client
Before indexing, initialize the Marqo client by specifying the URL where Marqo is running.
import marqo
# Initialize the Marqo client
mq = marqo.Client(url="http://localhost:8882")
Step 4: Index the Documents
Now, we'll index the documents we've created. Each document is a small part of the source material, making it more searchable.
print("Indexing, please wait...")
# Create an index called 'broken-down-source-material'
mq.create_index("broken-down-source-material")
# Add the documents to the index
mq.index("broken-down-source-material").add_documents(
[
{"Content": text, "_id": f"Alice_in_Wonderland_{idx}"}
for idx, text in enumerate(documents)
],
client_batch_size=64,
tensor_fields=["Content"],
)
Step 5: Search the Indexed Documents
Finally, let's search the index to find relevant documents. Here, we're looking for documents that mention something a caterpillar said.
import pprint
# Search the index for documents matching our query
results = mq.index("broken-down-source-material").search(
q="I am after the things that are said by a Caterpillar", limit=2
)
# Print out the search results
pprint.pprint(results)
You can modify the script to fit your data and search requirements.