Skip to content

Indexing a Large Text File

In cases where you want queries to return only a specific part of a document it can make sense to deconstruct documents into components before indexing.


Understanding the Term 'Document' in Marqo

In the context of Marqo, a 'document' is an entry that is indexed and can be a variety of things: an image, a pair of image and text, multiple paragraphs, a single sentence, etc. This can be different from the traditional sense of the word.

For this example, we will treat each sentence of a large text file as an individual 'document'. We'll refer to the original text file as 'source material' and the indexed entries as 'documents'.

Document Size Considerations

Marqo defaults to a size limit of 100,000 bytes for documents. Although adjustable, querying large documents may not be as efficient as querying smaller, component parts.

Walkthrough: Indexing a Large Text File

Starting Up Marqo

  1. To begin, run the Marqo Docker container:

    docker rm -f marqo;docker run --name marqo -it -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:2.0.0
    
    For more detailed instructions, see the getting started guide.

  2. Run the indexing_a_large_text_file.py file which can be found here. Please note that indexing may take some time depending on your system:

    python3 indexing_a_large_text_file.py
    

The Indexing Script Explained

In this section, we'll take a deeper dive into the indexing_a_large_text_file.py script. The script is broken down into four main steps to make it easier to follow along.

Step 1: Retrieve and Process the Source Material

First, we gather the text we want to index. In this example, we're using "Alice in Wonderland" from Project Gutenberg.

import urllib.request
from nltk.tokenize import sent_tokenize
from nltk.util import ngrams
from typing import List
import nltk

nltk.download("punkt")

# Get the source material as a string
source_material = ""
for line in urllib.request.urlopen("https://www.gutenberg.org/cache/epub/11/pg11.txt"):
    source_material += line.decode("utf-8") + " "


# Define a function to process the text into documents
def process_text(text: str, n: int) -> List[str]:
    # Simplify whitespace and remove underscores (used in Gutenberg for formatting)
    text = " ".join(text.split()).replace("_", "")
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    # Create n-grams of sentences to form our documents
    return [". ".join(gram) for gram in ngrams(sentences, n)]


# Process the source material into documents consisting of 2 sentences each
documents = process_text(source_material, 2)

Step 2: Initialize the Marqo Client

Before indexing, initialize the Marqo client by specifying the URL where Marqo is running.

import marqo

# Initialize the Marqo client
mq = marqo.Client(url="http://localhost:8882")

Step 3: Index the Documents

Now, we'll index the documents we've created. Each document is a small part of the source material, making it more searchable.

print("Indexing, please wait...")

# Create an index called 'broken-down-source-material'
mq.create_index("broken-down-source-material")

# Add the documents to the index
mq.index("broken-down-source-material").add_documents(
    [
        {"Content": text, "_id": f"Alice_in_Wonderland_{idx}"}
        for idx, text in enumerate(documents)
    ],
    client_batch_size=64,
    tensor_fields=["Content"],
)

Step 4: Search the Indexed Documents

Finally, let's search the index to find relevant documents. Here, we're looking for documents that mention something a caterpillar said.

import pprint

# Search the index for documents matching our query
results = mq.index("broken-down-source-material").search(
    q="I am after the things that are said by a Caterpillar", limit=2
)

# Print out the search results
pprint.pprint(results)

You can modify the script to fit your data and search requirements.

Full Code

indexing_a_large_text_file.py
"""
Note that this example requires the nltk library to be installed
"""

import marqo
import pprint
import urllib.request
from nltk.tokenize import sent_tokenize
from nltk.util import ngrams

from typing import List
import nltk

nltk.download("punkt")


#####################################################
### STEP 1. Get and process the data
#####################################################

print("Processing source material...")

# get all of Alice in Wonderland and put it into a string
source_material = ""
for line in urllib.request.urlopen("https://www.gutenberg.org/cache/epub/11/pg11.txt"):
    source_material += line.decode("utf-8") + " "

# check the length, this should print the notification that this text would be too long by default
if len(source_material) > 100000:
    print(
        f"This document is {len(source_material)} bytes which is larger than the default limit of 100000 bytes."
    )


def process_text(text: str, n: int) -> List[str]:
    """Simple text processing that converts source material into strings with n sentences (documents).

    Args:
        text (str): The text you wish to process
        n (int): The size of the ngrams

    Returns:
        List[str]: A list of strings with n sentences
    """
    # replace all white space with a single space
    text = " ".join(text.split())
    text = text.replace(
        "_", ""
    )  # Underscores are used as a formatting intidicator in Gutenberg txt files
    sentences: List[str] = sent_tokenize(text)

    # return a list of string with n sentences
    return [". ".join(gram) for gram in ngrams(sentences, n)]


# convert the source material into a set of documents that are groups of 2 sentences
# there are many ways you could do this, how you do this will depend heavily on your source
# and you usage of the results
# you can adjust n to see how it changes the outcomes
documents = process_text(source_material, 2)

#####################################################
### STEP 2. Start Marqo
#####################################################

# Follow the instructions here https://github.com/marqo-ai/marqo/tree/2.0.0

#####################################################
### STEP 3. Index the data
#####################################################

# NOTE: This step may take some time depending on your hardware, if you have cuda available then
# you may adjust the code to use it as described here https://docs.marqo.ai/0.0.16/using_marqo_with_a_gpu/#cuda

mq = marqo.Client(url="http://localhost:8882")

print("Indexing, please wait...")

mq.create_index("broken-down-source-material")

mq.index("broken-down-source-material").add_documents(
    [
        # by providing our own id for each entry we can track down
        # the location in the origin material if desired
        {"Content": text, "_id": f"Alice_in_Wonderland_{idx}"}
        for idx, text in enumerate(documents)
    ],
    client_batch_size=64,
    tensor_fields=["Content"],
)

#####################################################
### STEP 4. Search the data
#####################################################

results = mq.index("broken-down-source-material").search(
    q="I am after the things that are said by a Caterpillar", limit=2
)

pprint.pprint(results)

print("Done.")

# delete the index if done
# mq.delete_index("broken-down-source-material")