Using Structured Data Sets as Knowledge Source in AWS Bedrock KnowledgeBase (Respect The Structure !)
In an earlier post, I took AWS Bedrock for a test drive, in an attempt to put the building blocks of Agentic and RAG use-cases together. The examples in that post used pdf files as datasources for the KnowledgeBase, and it works with a reasonable chunking strategy with AWS OpenSearch vector store. How would a structured dataset like a csv/json work with the default KnowledgeBase implementation? Will a cookie-cutter chunking strategy work well for structured data? This post explores the approach of using csv/json(or any other structured type) as a source of knowledge in a RAG use case (or any other pattern where KnowledgeBase is used as a retriever). Specifically I want to explore the meaning of “structure” for RAG solution. Should the structure be somehow preserved during ingestion ? Is it of any value to rely on each cohesive record in a structured data for creating embeddings or meta-data for embeddings ?
Supported Source Code is available in this repo
A knowledge source such as PDF or text, if used in ingestion pipeline can be chunked up in different ways :
- Using a fixed length chunking/token number and an optional overlap chunking/token number, to create embeddings records.
- Treating each page as an embeddings record.
- Using recursive splitters to treat paragraphs as source for embedding records.
In each of these strategies, there is some semblance of ‘structure’, but it is all subjective and doesn’t inherently take the definition & values from the PDF itself. A typical meta-data in such cases consists of Source, AuthorName, PageNumber, Date, ChunkNumber, which any good Document Loader and Splitter Framework provides out of box.
Let’s take case of a CSV, a structured data format. Using CSV as RAG data ingestion source should be commonplace. For example :
- Dump from known tools such as ServiceNow, Jira, DynamoDB etc.
- Database/Warehouse output.
- Structured output from LLMs
Here is a an example CSV dataset, that consists of user-reviews for a restaurant. (present in /data/RawReviews_1k.csv in repo). This dataset has been taken from ‘Amazon Fine Food Reviews’ kaggle dataset
This file contains a well defined set of fields — Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
While prompting/searching this dataset, I want to be able to :
- R1 : Trace an LLM outputs to specific user-reviews
- R2: Use metadata filters to improve RAG, using values for some of the fields
- R3: Create vector embeddings for only some relevant fields
We will first explore the default, out-of-box path of using this csv to create KnowledgeBase. Using AWS Bedrock, KnowledgeBase, it is possible to provide a static meta-data file in the source bucket. For example, if you have a file called A.csv, the metadata file must be named A.csv.metadata.json, where the meta-data fields can be specified, that KnowledgeBase uses to create meta-data for the OpenSearch Vector Store. This metadata will remain same, for all the “chunks” created from this file. So there could be a case of creating multiple input csv files where the meta-data remains consistent. For example, all 2011 reviews for “Misha Grill” can have this supporting meta-data in source bucket:
{
"metadataAttributes": {
"location": "Misha Grill",
"year": 2011
}
}
In the first attempt, I will use the “RawReviews_1k.csv” as one single object in source bucket:
and use the default chunking strategy (300 token length)
and use the console provided AWS BedRock KnowledgeBase playground. You can also run the RAG conversations using simple python script provided in repo.
Prompt : “Provide the best of chicken reviews where user has used positive adjectives in the review”
Output :
Evaluation :
- R1 : Trace an LLM outputs to specific user-reviews → [NOT MET] : The only meta-data provided is source file. And multiple reviews have been included in the default chunk size of 300 tokens.
- R2: Use metadata filters to improve RAG, using values for some of the fields → [NOT MET] : There is no meta-data to filter with
- R3: Create vector embeddings for only some relevant fields → [NOT MET] : Source Chunk reveals that embeddings has been created from all fields.
The output is reasonable, and LLM summarises the output in a naturally conversational format.
Now, lets use this knowledgebase only as a retriever (ref : src/kb_csvdataRAG.py)
default_retrieval_config ={
"vectorSearchConfiguration":
{
"numberOfResults": 5
}
}
def as_br_kb_retriever():
client = boto3.client('bedrock-agent-runtime')
while True:
q_ = input("(q to quit): ")
if q_ == 'q':
break
res = client.retrieve(
knowledgeBaseId=kbi,
retrievalConfiguration=default_retrieval_config,
retrievalQuery={
"text": q_
}
)
for index, doc in enumerate(res['retrievalResults'], start=1):
print("-"*150)
print(f"{doc['content']}")
print(f"{doc['metadata']}")
For a simple retrieval (without LLM to summarise) , the output is not very intuitive.
Now, to enable meeting R1, R2, R3 with limitations of AWS Bedrock Knowledge Source, one possible option was mentioned before i.e. to create multiple datasets in such as way that *.metadata.json can be supplied with each set where it remains consistent. What does it mean in case if this CSV where I need to choose from the fields in the source file itself for embeddings and meta-data ? It would mean a pre-processor in the ingestion pipeline that parses the main CSV and creates a csv and meta-data output files for each record, from a definitions file provided. The main CSV in this case has about 1k records, so I will attempt to create 1k output files, load them in source S3 and the sync the knowledge base. The temporary 1k files can be cleaned up (as part of ingestion pipeline).
Refer to the definition file provided — data/knowledgebasemetadatprep.json
{
"csv": {
"embeddingattributes": [
"ProfileName",
"Summary",
"Text"
],
"metadataattributes": [
"ProductId",
"UserId",
"Time",
"Id",
"Score"
],
"index_id": [
"ProductId",
"UserId",
"Id"
]
}
}
This is a specific processing file for CSV presented in this example. This processing file is used by a pre-processor script, that creates :
One csv file for each record, with “embeddingattributes”.
One *.csv.metadata.json file for the csv file produced with “metadataattributes” &
uses the concatenated index_id fields to create the unique file names.
For example :
B000G6RYNE_A1E2Y40TTPL8YK_464.csv
B000G6RYNE_A1E2Y40TTPL8YK_464.csv.metadata.json
So kb_csvdataprep.py creates these outputs in /data/outputs folder and then uploads them to the KnowledgeBase Source S3 bucket.
The KnowledgeBase source for this source s3 bucket, is defined with “NoChunking” strategy, which means that the .csv file in itself is a chunk and no further chunking is needed. In case the contents in the CSV in the embeddings fields are huge, a chunking strategy can be adopted, in which case the meta-data for the chunks would remain the same, but more fine grained documents would be created in VectorDB.
After the KnowledgeBase pipeline is run, the sync operation in DataSource would pull equal number of csv and meta-data files;
and OpenSearch creates the index with the defined meta-data and all custom defined meta-data fields would be filterable.
We have now been able to create meta-data from the source CSV itself.
Filterability is an important optimisation in RAG use-case to make the vector database output more relevant , in order to augment specific context for the LLM).
Lets test this Knowledgebase with the same Prompt, with additional meta-data filters.
good_score_retrieval_config ={
"vectorSearchConfiguration":
{
"numberOfResults": 10,
"filter": {
"andAll":[
{
"greaterThanOrEquals":
{
"key": "Score",
"value": 3
}
},
{
"greaterThan":
{
"key": "Time",
"value": 1273152420
}
}
]
}
}
}
def as_br_kb_retriever():
client = boto3.client('bedrock-agent-runtime')
while True:
q_ = input("(q to quit): ")
if q_ == 'q':
break
res = client.retrieve(
knowledgeBaseId=kbi,
retrievalConfiguration=good_score_retrieval_config,
retrievalQuery={
"text": q_
}
)
for index, doc in enumerate(res['retrievalResults'], start=1):
print("-"*150)
print(f"{doc['content']}")
print(f"{doc['metadata']}")
This retriever uses additional meta-data filters — 1) filter by score field 2) all review post a timestamp, which improves the quality of retrieval and LLM generation.
Prompt : “Provide the best of chicken reviews where user has used positive adjectives in the review”
Output Retrieve :
Output Retrieve & Generate :
Evaluation :
- R1 : Trace an LLM outputs to specific user-reviews → [MET] : Since each review part of distinct meta-data record
- R2: Use metadata filters to improve RAG, using values for some of the fields → [MET] : All meta-data defined with “metadataattributes” can be used as custom filters, before vector matching
- R3: Create vector embeddings for only some relevant fields → [MET] : Only fields defined in “embeddingattributes” land as embeddings.
So, dealing with structured data for RAG needs some attention, than simple and raw chunking, to leverage the inherent normality of the data. This provides many advantages :
- Control over the use of attributes as embeddings vs meta-data
- improve RAG output with meta-data filters
- better traceability and citations
This post describes handling a structured dataset with AWS Bedrock, but this is relevant for another other structured type, with any other implementation of RAG ingestion data pipeline, with any other vector datastore.
Thank You.