Skip to content

Indexing a Large Text File

Indexing a Large Text File

In cases where you want queries to return only a specific part of a document it can make sense to deconstruct documents into components before indexing.

Disambiguating the Term 'Document'

Before we continue with this example we must disambiguate the term document. In Marqo, a document referring to a thing that gets indexed, this may be an image, and image and text pair, multiple paragraphs of text, a sentence or some other combination of the above.

For some use cases this can differ from the common usage of the word 'document'. This is the case for this example where a large 'document' (text file) is broken into sentence and each sentence is indexed as a document.

From here on, we will refer to the input text file as 'source material' and the entries that we put into the index as 'documents'.

Document Size Limits

By default marqo imposes a size limit of 100,000 bytes. While this can be adjusted - often querying large documents is not as useful as querying parts of documents.

Running the example

  1. Run Marqo:

    docker rm -f marqo;docker run --name marqo -it --privileged -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:latest   
    
    For mode detailed instructions, check the getting started guide.

  2. Run the indexing_a_large_text_file.py script via the following command (Note it can take a bit of time to index depending on the computer):

    python3 indexing_a_large_text_file.py
    

The full code is below with examples of how to use the python client.

Code

indexing_a_large_text_file.py
'''
Note that this example requires the nltk library to be installed
'''

import marqo 
import pprint
import urllib.request
from nltk.tokenize import sent_tokenize
from nltk.util import ngrams

from typing import List 


#####################################################
### STEP 1. Get and process the data
#####################################################

print("Processing source material...")

# get all of Alice in Wonderland and put it into a string
source_material = ""
for line in urllib.request.urlopen("https://www.gutenberg.org/cache/epub/11/pg11.txt"):
    source_material += line.decode("utf-8") + " "

# check the length, this should print the notification that this text would be too long by default
if len(source_material) > 100000:
    print(f"This document is {len(source_material)} bytes which is larger than the default limit of 100000 bytes.")

def process_text(text: str, n: int) -> List[str]:
    """Simple text processing that converts source material into strings with n sentences (documents).

    Args:
        text (str): The text you wish to process
        n (int): The size of the ngrams

    Returns:
        List[str]: A list of strings with n sentences
    """
    # replace all white space with a single space
    text = " ".join(text.split())
    text = text.replace("_", "") # Underscores are used as a formatting intidicator in Gutenberg txt files
    sentences: List[str] = sent_tokenize(text)

    # return a list of string with n sentences
    return ['. '.join(gram) for gram in ngrams(sentences, n)]

# convert the source material into a set of documents that are groups of 2 sentences
# there are many ways you could do this, how you do this will depend heavily on your source 
# and you usage of the results 
# you can adjust n to see how it changes the outcomes
documents = process_text(source_material, 2)

#####################################################
### STEP 2. Start Marqo
#####################################################

# Follow the instructions here https://github.com/marqo-ai/marqo

#####################################################
### STEP 3. Index the data
#####################################################

# NOTE: This step may take some time depending on your hardware, if you have cuda available then 
# you may adjust the code to use it as described here https://docs.marqo.ai/0.0.16/using_marqo_with_a_gpu/#cuda

mq = marqo.Client(url='http://localhost:8882')

print("Indexing, please wait...")

mq.create_index("broken-down-source-material")

mq.index("broken-down-source-material").add_documents(
    [
        # by providing our own id for each entry we can track down 
        # the location in the origin material if desired
        {'Content': text, '_id': f"Alice_in_Wonderland_{idx}"} for idx, text in enumerate(documents)
    ], 
    client_batch_size=64
)

#####################################################
### STEP 4. Search the data
#####################################################

results = mq.index("broken-down-source-material").search(
    q="I am after the things that are said by a Caterpillar", 
    limit=2
)

pprint.pprint(results)

print("Done.")

# delete the index if done
# mq.delete_index("broken-down-source-material")