Generative Search Question and Answering
Introduction
This guide will walk you through setting up a question-answering system using Marqo. We'll break down the code into smaller sections and explain each step to make the process more approachable.
Prerequisites
Before we begin, ensure you have the following:
- Docker installed on your machine.
- An API key for the LLM you wish to use (e.g., OpenAI).
Getting Started
Step 1: Clone the Repository
First, clone the Marqo repository. The code we'll be working with is in the examples/GPT-examples folder:
git clone --branch 2.0.0 https://github.com/marqo-ai/marqo.git
Step 2: Run Marqo
Next, set up Marqo using Docker:
docker rm -f marqo
docker pull marqoai/marqo:2.0.0
docker run --name marqo -it -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:2.0.0
For more detailed instructions, check the getting started guide.
Step 3: Set Up Your API Key
Obtain your API key and set it as an environment variable:
export OPENAI_API_KEY="your-api-key-here"
Walkthrough Guide
Step 1: Setup Marqo Client and Index
Let's set up the Marqo client and create an index for our documents:
from marqo import Client
mq = Client()
index_name = "iron-docs"
# Optionally delete the index if it already exists
try:
mq.index(index_name).delete()
except:
pass
# Create the index with custom settings
index_settings = {
"model": "e5-base-v2",
"normalizeEmbeddings": True,
"textPreprocessing": {
"splitLength": 3,
"splitOverlap": 1,
"splitMethod": "sentence"
},
}
mq.create_index(index_name, settings_dict=index_settings)
Step 2: Load and Prepare Data
Load your data and prepare it for indexing:
from utilities import load_data
df = load_data()
documents = df.to_dict(orient='records')
Step 3: Index the Data
Index the documents into Marqo:
# Index the documents
indexing = mq.index(index_name).add_documents(documents, tensor_fields=["cleaned_text"], client_batch_size=64)
Step 4: Performing a Search Query
Perform a search query to find relevant documents:
# Try a generic search
q = "what is the rated voltage"
results = mq.index(index_name).search(q)
print(results['hits'][0])
Step 5: Enhance Search with LLM Chain
Use an LLM Chain to enhance the search results with a conversational interface:
from langchain_openai import OpenAI
from langchain.docstore.document import Document
from langchain.chains import LLMChain
from utilities import extract_text_from_highlights, qna_prompt
highlights, texts = extract_text_from_highlights(results, token_limit=150)
docs = [Document(page_content=f"Source [{ind}]:" + t) for ind, t in enumerate(texts)]
llm = OpenAI(temperature=0.9)
chain_qa = LLMChain(llm=llm, prompt=qna_prompt())
llm_results = chain_qa.invoke({"summaries": docs, "question": results['query']}, return_only_outputs=True)
print(llm_results['text'])
Step 6: Scoring and Presenting References
Finally, score the references and present them to the user:
from utilities import predict_ce, get_sorted_inds
import numpy as np
import pandas as pd
score_threshold = 0.20
top_k = 3
scores = predict_ce(llm_results['text'], texts)
inds = get_sorted_inds(scores)
scores = scores.cpu().numpy()
scores = [np.round(s[0],2) for s in scores]
references = [(str(np.round(scores[i],2)),texts[i]) for i in inds[:top_k] if scores[i] > score_threshold]
df_ref = pd.DataFrame(references, columns=['score','sources'])
print(df_ref)
Full Code
product_q_n_a.py
from marqo import Client
import pandas as pd
import numpy as np
from langchain_openai import OpenAI
from langchain.docstore.document import Document
from langchain.chains import LLMChain
from dotenv import load_dotenv
from utilities import (
load_data,
extract_text_from_highlights,
qna_prompt,
predict_ce,
get_sorted_inds
)
load_dotenv()
if __name__ == "__main__":
#############################################################
# STEP 0: Install Marqo
#############################################################
# run the following docker commands from the terminal to start marqo
# docker rm -f marqo
# docker pull marqoai/marqo:2.0.0
# docker run --name marqo -it -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:2.0.0
#############################################################
# STEP 1: Setup Marqo
#############################################################
mq = Client()
index_name = "iron-docs"
# (optinally) delete if it already exists
try:
mq.index(index_name).delete()
except:
pass
# we can set some specific settings for the index. if they are not provided, sensible defaults are used
index_settings = {
"model": "e5-base-v2",
"normalizeEmbeddings": True,
"textPreprocessing": {
"splitLength": 3,
"splitOverlap": 1,
"splitMethod": "sentence"
},
}
# create the index with custom settings
mq.create_index(index_name, settings_dict=index_settings)
#############################################################
# STEP 2: Load the data
#############################################################
df = load_data()
# turn the data into a dict for indexing
documents = df.to_dict(orient='records')
#############################################################
# STEP 3: Index the data
#############################################################
# index the documents
indexing = mq.index(index_name).add_documents(documents, tensor_fields=["cleaned_text"], client_batch_size=64)
#############################################################
# STEP 4: Search the data
#############################################################
# try a generic search
q = "what is the rated voltage"
results = mq.index(index_name).search(q)
print(results['hits'][0])
#############################################################
# STEP 5: Make it chatty
#############################################################
highlights, texts = extract_text_from_highlights(results, token_limit=150)
docs = [Document(page_content=f"Source [{ind}]:" + t) for ind, t in enumerate(texts)]
llm = OpenAI(temperature=0.9)
chain_qa = LLMChain(llm=llm, prompt=qna_prompt())
llm_results = chain_qa.invoke({"summaries": docs, "question": results['query']}, return_only_outputs=True)
print(llm_results['text'])
#############################################################
# STEP 6: Score the references
#############################################################
score_threshold = 0.20
top_k = 3
scores = predict_ce(llm_results['text'], texts)
inds = get_sorted_inds(scores)
scores = scores.cpu().numpy()
scores = [np.round(s[0], 2) for s in scores]
references = [(str(np.round(scores[i], 2)), texts[i]) for i in inds[:top_k] if scores[i] > score_threshold]
df_ref = pd.DataFrame(references, columns=['score', 'sources'])
print(df_ref)