GPT/LLM Question and Answering

Introduction

This guide will walk you through setting up a question-answering system using Marqo. We'll break down the code into smaller sections and explain each step to make the process more approachable.

Prerequisites

Before we begin, ensure you have the following:

Docker installed on your machine.
An API key for the LLM you wish to use (e.g., OpenAI).

Getting Started

Step 1: Clone the Repository

First, clone the Marqo repository. The code we'll be working with is in the examples/GPT-examples folder:

git clone --branch 2.0.0 https://github.com/marqo-ai/marqo.git

Step 2: Run Marqo

Next, set up Marqo using Docker:

docker rm -f marqo
docker pull marqoai/marqo:2.0.0
docker run --name marqo -it -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:2.0.0

For more detailed instructions, check the getting started guide.

Step 3: Set Up Your API Key

Obtain your API key and set it as an environment variable:

export OPENAI_API_KEY="your-api-key-here"

Walkthrough Guide

Step 1: Setup Marqo Client and Index

Let's set up the Marqo client and create an index for our documents:

from marqo import Client

mq = Client()
index_name = "iron-docs"

# Optionally delete the index if it already exists
try:
    mq.index(index_name).delete()
except:
    pass

# Create the index with custom settings
index_settings = {
    "model": "flax-sentence-embeddings/all_datasets_v4_MiniLM-L6",
    "normalizeEmbeddings": True,
    "textPreprocessing": {
        "splitLength": 3,
        "splitOverlap": 1,
        "splitMethod": "sentence"
    },
}
mq.create_index(index_name, settings_dict=index_settings)

Step 2: Load and Prepare Data

Load your data and prepare it for indexing:

from utilities import load_data

df = load_data()
documents = df.to_dict(orient='records')

Step 3: Index the Data

Index the documents into Marqo:

# Index the documents
indexing = mq.index(index_name).add_documents(documents, tensor_fields=["cleaned_text"], client_batch_size=64)

Step 4: Performing a Search Query

Perform a search query to find relevant documents:

# Try a generic search
q = "what is the rated voltage"
results = mq.index(index_name).search(q)
print(results['hits'][0])

Step 5: Enhance Search with LLM Chain

Use an LLM Chain to enhance the search results with a conversational interface:

from langchain_openai import OpenAI
from langchain.docstore.document import Document
from langchain.chains import LLMChain
from utilities import extract_text_from_highlights, qna_prompt

highlights, texts = extract_text_from_highlights(results, token_limit=150)
docs = [Document(page_content=f"Source [{ind}]:" + t) for ind, t in enumerate(texts)]
llm = OpenAI(temperature=0.9)
chain_qa = LLMChain(llm=llm, prompt=qna_prompt())
llm_results = chain_qa.invoke({"summaries": docs, "question": results['query']}, return_only_outputs=True)
print(llm_results['text'])

Step 6: Scoring and Presenting References

Finally, score the references and present them to the user:

from utilities import predict_ce, get_sorted_inds
import numpy as np
import pandas as pd

score_threshold = 0.20
top_k = 3
scores = predict_ce(llm_results['text'], texts)
inds = get_sorted_inds(scores)
scores = scores.cpu().numpy()
scores = [np.round(s[0],2) for s in scores]
references = [(str(np.round(scores[i],2)),texts[i]) for i in inds[:top_k] if scores[i] > score_threshold]
df_ref = pd.DataFrame(references, columns=['score','sources'])
print(df_ref)

Full Code

product_q_n_a.py

from marqo import Client
import pandas as pd
import numpy as np

from langchain_openai import OpenAI
from langchain.docstore.document import Document
from langchain.chains import LLMChain

from dotenv import load_dotenv

from utilities import (
    load_data,
    extract_text_from_highlights,
    qna_prompt,
    predict_ce,
    get_sorted_inds
)

load_dotenv()

if __name__ == "__main__":

    #############################################################
    #       STEP 0: Install Marqo
    #############################################################

    # run the following docker commands from the terminal to start marqo
    # docker rm -f marqo
    # docker pull marqoai/marqo:2.0.0
    # docker run --name marqo -it -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:2.0.0

    #############################################################
    #       STEP 1: Setup Marqo
    #############################################################

    mq = Client()
    index_name = "iron-docs"

    # (optinally) delete if it already exists
    try:
        mq.index(index_name).delete()
    except:
        pass

    # we can set some specific settings for the index. if they are not provided, sensible defaults are used
    index_settings = {
        "model": "flax-sentence-embeddings/all_datasets_v4_MiniLM-L6",
        "normalizeEmbeddings": True,
        "textPreprocessing": {
            "splitLength": 3,
            "splitOverlap": 1,
            "splitMethod": "sentence"
        },
    }

    # create the index with custom settings
    mq.create_index(index_name, settings_dict=index_settings)

    #############################################################
    #       STEP 2: Load the data
    #############################################################

    df = load_data()

    # turn the data into a dict for indexing
    documents = df.to_dict(orient='records')

    #############################################################
    #       STEP 3: Index the data
    #############################################################

    # index the documents
    indexing = mq.index(index_name).add_documents(documents, tensor_fields=["cleaned_text"], client_batch_size=64)

    #############################################################
    #       STEP 4: Search the data
    #############################################################

    # try a generic search
    q = "what is the rated voltage"

    results = mq.index(index_name).search(q)
    print(results['hits'][0])

    #############################################################
    #       STEP 5: Make it chatty
    #############################################################

    highlights, texts = extract_text_from_highlights(results, token_limit=150)
    docs = [Document(page_content=f"Source [{ind}]:" + t) for ind, t in enumerate(texts)]
    llm = OpenAI(temperature=0.9)
    chain_qa = LLMChain(llm=llm, prompt=qna_prompt())
    llm_results = chain_qa.invoke({"summaries": docs, "question": results['query']}, return_only_outputs=True)
    print(llm_results['text'])

    #############################################################
    #       STEP 6: Score the references
    #############################################################

    score_threshold = 0.20
    top_k = 3
    scores = predict_ce(llm_results['text'], texts)
    inds = get_sorted_inds(scores)
    scores = scores.cpu().numpy()
    scores = [np.round(s[0], 2) for s in scores]
    references = [(str(np.round(scores[i], 2)), texts[i]) for i in inds[:top_k] if scores[i] > score_threshold]
    df_ref = pd.DataFrame(references, columns=['score', 'sources'])
    print(df_ref)