Graph RAG on Movie Reviews from Rotten Tomatoes¶
This notebook presents a basic case study for using graph RAG techniques to combine the power of retrieval-augmented generation (RAG) with knowledge graphs based on datasets that are linked to one another in a natural way.
In particular, we use the GraphRetriever
implementation in LangChain. For more
information, see the open-source Graph RAG project on
GitHub
The Dataset¶
The website Rotten Tomatoes has published a large dataset of movie reviews. The dataset includes two CSV files containing:
- the movie reviews, and
- information about the movies referenced in those reviews
The Challenge¶
In this case study, the challenge is to build a system that allows users to search movie review content using arbitrary prompts, and then return the top reviews together with the full information about the reviewed movies.
The Strategy¶
First, we build a standard RAG system for querying the movie reviews, which are embedded and stored in a vector database. It is important to note that in this step, we store the embedded reviews together with metadata that is necessary for traversing the knowledge graph and linking reviews with the movie data.
Second, we use a GraphRetriever
that is configured specifically to:
- retrieve relevant movie reviews via standard RAG,
- traverse the knowledge graph edges to the relevant movies, and
- return the full movie data together with each movie review.
In this implementation, the metadata is the basis for the knowledge graph, and
the mechanics of graph traversal is specified as part of the GraphRetriever
.
In this way, a change in the configuration of the GraphRetriever
changes the
way that graph edges are defined and how the implied knowledge graph is
traversed. There is no need to modify the data set or re-build the knoweledge
graph beyond specifying a new GraphRetriever
configuration.
See below for how to build this graph RAG system.
# install the required packages
%pip install \
dotenv \
pandas \
langchain_openai \
langchain-graph-retriever \
langchain-astradb
Environment Setup¶
This notebook uses the APIs for OpenAI and Astra DB
NOTE: the environment variables for Astra DB are not required if running only the code with the small data sample below, but are required for the code below that works with the full dataset.
You can get an OpenAI API key here. And, more information about using the OpenAI API in Python can be found here.
Here are the instructions to set up a free Astra serverless database.
To connect to these services within this notebook, the following environment variables are required (or optional, as noted):
OPENAI_API_KEY
: Your OpenAI API key.ASTRA_DB_API_ENDPOINT
: The Astra DB API endpoint.ASTRA_DB_APPLICATION_TOKEN
: The Astra DB Application token.ASTRA_DB_KEYSPACE
: Optional. If defined, will specify the Astra DB keyspace. If not defined, will use the default.LANGCHAIN_API_KEY
: Optional. If defined, will enable LangSmith tracing.
If running this notebook in Colab, configure these environment variables as Colab Secrets.
If running this notebook locally, make sure you have a .env
file containing
all of the required variables, and then use the dotenv
package as below to
load environment variables from that file. More details on dotenv
can be found here.
from dotenv import load_dotenv
# load environment variables from the .env file
load_dotenv()
Loading the data¶
The website Rotten Tomatoes has published a large dataset of movie reviews. containing:
rotten_tomatoes_movie_reviews.csv
-- the movie reviewsrotten_tomatoes_movies.csv
-- information about the movies referenced in those reviews
Below, we first give a small sample dataset contained in this notebook, so that you can try this implementation of graph RAG without needing to download and process the full dataset from files.
Or, you can skip loading this data sample and proceed directly to "Loading the full dataset from file" below.
Loading a small data sample¶
Below is a sample dataset that is coded into this notebook as string objects and then read into pandas
dataframes using StringIO
.
import pandas as pd
from io import StringIO
reviews_data_string = """
id,reviewId,creationDate,criticName,isTopCritic,originalScore,reviewState,publicatioName,reviewText,scoreSentiment,reviewUrl
addams_family,2644238,2019-11-10,James Kendrick,False,3/4,fresh,Q Network Film Desk,captures the family's droll humor with just the right mixture of morbidity and genuine care,POSITIVE,http://www.qnetwork.com/review/4178
addams_family,2509777,2018-09-12,John Ferguson,False,4/5,fresh,Radio Times,A witty family comedy that has enough sly humour to keep adults chuckling throughout.,POSITIVE,https://www.radiotimes.com/film/fj8hmt/the-addams-family/
addams_family,26216,2000-01-01,Rita Kempley,True,,fresh,Washington Post,"More than merely a sequel of the TV series, the film is a compendium of paterfamilias Charles Addams's macabre drawings, a resurrection of the cartoonist's body of work. For family friends, it would seem a viewing is de rigueur mortis.",POSITIVE,http://www.washingtonpost.com/wp-srv/style/longterm/movies/videos/theaddamsfamilypg13kempley_a0a280.htm
the_addams_family_2019,2699537,2020-06-27,Damond Fudge,False,,fresh,"KCCI (Des Moines, IA)","As was proven by the 1992-93 cartoon series, animation is the perfect medium for this creepy, kooky family, allowing more outlandish escapades",POSITIVE,https://www.kcci.com/article/movie-review-the-addams-family/29443537
the_addams_family_2019,2662133,2020-01-21,Ryan Silberstein,False,,fresh,Cinema76,"This origin casts the Addams family as an immigrant story, and the film leans so hard into the theme of accepting those different from us and valuing diversity over conformity,",POSITIVE,https://www.cinema76.com/home/2019/10/11/the-addams-family-is-a-fun-update-to-an-iconic-american-clan
the_addams_family_2019,2661356,2020-01-17,Jennifer Heaton,False,5.5/10,rotten,Alternative Lens,...The film's simplistic and episodic plot put a major dampener on what could have been a welcome breath of fresh air for family animation.,NEGATIVE,https://altfilmlens.wordpress.com/2020/01/17/my-end-of-year-surplus-review-extravaganza-thing-2019/
the_addams_family_2,102657551,2022-02-16,Mat Brunet,False,4/10,rotten,AniMat's Review (YouTube),The Addams Family 2 repeats what the first movie accomplished by taking the popular family and turning them into one of the most boringly generic kids films in recent years.,NEGATIVE,https://www.youtube.com/watch?v=G9deslxPDwI
the_addams_family_2,2832101,2021-10-15,Sandie Angulo Chen,False,3/5,fresh,Common Sense Media,This serviceable animated sequel focuses on Wednesday's feelings of alienation and benefits from the family's kid-friendly jokes and road trip adventures.,POSITIVE,https://www.commonsensemedia.org/movie-reviews/the-addams-family-2
the_addams_family_2,2829939,2021-10-08,Emily Breen,False,2/5,rotten,HeyUGuys,"Lifeless and flat, doing a disservice to the family name and the talent who voice them. WIthout glamour, wit or a hint of a soul. A void. Avoid.",NEGATIVE,https://www.heyuguys.com/the-addams-family-2-review/
addams_family_values,102735159,2022-09-22,Sean P. Means,False,3/4,fresh,Salt Lake Tribune,Addams Family Values is a ghoulishly fun time. It would have been a real howl if the producers weren't too scared to go out on a limb in this twisted family tree.,POSITIVE,https://www.newspapers.com/clip/110004014/addams-family-values/
addams_family_values,102734540,2022-09-21,Jami Bernard,True,3.5/4,fresh,New York Daily News,"The title is apt. Using those morbidly sensual cartoon characters as pawns, the new movie Addams Family Values launches a witty assault on those with fixed ideas about what constitutes a loving family. ",POSITIVE,https://www.newspapers.com/clip/109964753/addams-family-values/
addams_family_values,102734521,2022-09-21,Jeff Simon,False,3/4,fresh,Buffalo News,"Addams Family Values has its moments -- rather a lot of them, in fact. You knew that just from the title, which is a nice way of turning Charles Addams' family of ghouls, monsters and vampires loose on Dan Quayle.",POSITIVE,https://buffalonews.com/news/quirky-values-the-addams-family-returns-with-a-bouncing-baby/article_2aafde74-da6c-5fa7-924a-76bb1a906d9c.html
"""
movies_data_string = """
id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix
addams_family,The Addams Family,66,67,,,1991-11-22,2005-08-18,99,Comedy,English,Barry Sonnenfeld,"Charles Addams,Caroline Thompson,Larry Wilson",$111.3M,Paramount Pictures,"Surround, Dolby SR"
the_addams_family_2019,The Addams Family,69,45,PG,"['Some Action', 'Macabre and Suggestive Humor']",2019-10-11,2019-10-11,87,"Kids & family, Comedy, Animation",English,"Conrad Vernon,Greg Tiernan","Matt Lieberman,Erica Rivinoja",$673.0K,Metro-Goldwyn-Mayer,Dolby Atmos
the_addams_family_2,The Addams Family 2,69,28,PG,"['Macabre and Rude Humor', 'Language', 'Violence']",2021-10-01,2021-10-01,93,"Kids & family, Comedy, Adventure, Animation",English,"Greg Tiernan,Conrad Vernon","Dan Hernandez,Benji Samit,Ben Queen,Susanna Fogel",$56.5M,Metro-Goldwyn-Mayer,
addams_family_reunion,Addams Family Reunion,33,,,,,,92,Comedy,English,Dave Payne,,,,
addams_family_values,Addams Family Values,63,75,,,1993-11-19,2003-08-05,93,Comedy,English,Barry Sonnenfeld,Paul Rudnick,$45.7M,"Argentina Video Home, Paramount Pictures","Surround, Dolby Digital"
"""
reviews_all = pd.read_csv(StringIO(reviews_data_string))
movies_all = pd.read_csv(StringIO(movies_data_string))
Pre-processing the data¶
First, we rename one column in each of the two dataframes so that we can use them later to build a knowledge graph.
# rename the id columns to more informative and useful names
reviews_data = reviews_all.rename(columns={"id": "reviewed_movie_id"})
movies_data = movies_all.rename(columns={"id": "movie_id"})
Create the vector store, with embedding¶
Next, for the small data sample, we create an InMemoryVectorStore
from
LangChain using OpenAIEmbeddings()
to embed the documents.
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
# create the vector store
vectorstore = InMemoryVectorStore(OpenAIEmbeddings())
Loading the full dataset from file¶
Before running this code, make sure you have downloaded (and extracted) the dataset from the link provided above. The date files should be in your working directory, or you will need to change the file paths below to match the locations of your files.
See the top of this notebook for links and information about the datasets.
import pandas as pd
# Change this to the path where you stored the data files. See the top of this
# notebook for links and information about the datasets.
DATA_PATH = "../../../../datasets/"
# read the datasets from CSV files
reviews_all = pd.read_csv(DATA_PATH + "rotten_tomatoes_movie_reviews.csv")
movies_all = pd.read_csv(DATA_PATH + "rotten_tomatoes_movies.csv")
print("Data is loaded from CSV.")
Pre-processing the data¶
First, we rename one column in each of the two dataframes so that we can use them later to build a knowledge graph.
# rename the id columns to more informative and useful names
reviews_all = reviews_all.rename(columns={"id": "reviewed_movie_id"})
movies_all = movies_all.rename(columns={"id": "movie_id"})
Next, let's have a look at the movies that have the most reviews, and take a subset of the reviews to save time in this demo.
# Here, we limit our dataset to the movies with the most reviews. This is simply
# to save data processing and loading time while testing things in this notebook.
N_TOP_MOVIES = 10
most_reviewed_movies = reviews_all["reviewed_movie_id"].value_counts()[:N_TOP_MOVIES]
most_reviewed_movies
# subset the data to only reviews and movies corresponding to the most reviewed movies
reviews_data = reviews_all[
reviews_all["reviewed_movie_id"].isin(most_reviewed_movies.index)
]
movies_data = movies_all[movies_all["movie_id"].isin(most_reviewed_movies.index)]
Create the vector store, with embedding¶
Next, for the small data sample, we create an AstraDBVectorStore
from
LangChain using OpenAIEmbeddings()
to embed the documents.
from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings
COLLECTION = "movie_reviews_rotten_tomatoes"
vectorstore = AstraDBVectorStore(
embedding=OpenAIEmbeddings(),
collection_name=COLLECTION,
pre_delete_collection=True,
)
Convert data to Document
objects and store them¶
Next, we convert both movies and movie reviews into LangChain Document
objects. The content of each document---which is embedded into vectors---is
configured to be the movie review text (for review documents) or the movie title
(for movie documents). All remaining information is saved as metadata on each
document.
Note that to save time in this demo, we limit the dataset to include only the movies that have the most reviews.
from langchain_core.documents import Document
# Convert each movie review into a LangChain document
documents = []
# convert each movie into a LangChain document
for index, row in movies_data.iterrows():
content = str(row["title"])
metadata = row.fillna("").astype(str).to_dict()
metadata["doc_type"] = "movie_info"
document = Document(page_content=content, metadata=metadata)
documents.append(document)
for index, row in reviews_data.iterrows():
content = str(row["reviewText"])
metadata = row.drop("reviewText").fillna("").astype(str).to_dict()
metadata["doc_type"] = "movie_review"
document = Document(page_content=content, metadata=metadata)
documents.append(document)
# check the total number of documents
print("There are", len(documents), "total Documents")
# let's inspect the structure of a document
from pprint import pprint
pprint(documents[0].metadata)
# add documents to the store
vectorstore.add_documents(documents)
# NOTE: this may take some minutes to load many documents
Setting up the GraphRetriever¶
The GraphRetriever
operates on top of the vector store, using document
metadata to traverse the implicit knowledge graph as defined by the edges
parameter in GraphRetriever
configuration.
Edges are specified as directed pairs of metadata fields. In the example below, the edge configuration
edges = [("reviewed_movie_id", "movie_id")]
specifies that there is a directed graph edge between two documents whenever the
reviewed_movie_id
of the first document matches the movie_id
of the
second---and graph traversal proceeds along these directed edges. In this case,
all of our edges lead from a document containing a movie review to a document
containing information about the movie.
The strategy
parameter of the GraphRetriever
configuration determines how
the graph is traversed, starting with the initial documents retrieved and
proceeding along the directed edges to adjacent documents.
In the example below, the configuration
strategy=Eager(start_k=10,
adjacent_k=10,
select_k=10,
max_depth=1)
uses the following steps:
- it initially retrieves
start_k=10
documents using pure vector search, - then traverses graph edges from the initial documents to adjacent documents
(a max of
adjacent_k
), - it repeats traversal from the new documents until reaching
max_depth=1
, - it returns both the initial documents and documents retrieved during
traversal, up to a maximum of
select_k
documents.
Note that in this simple example, each movie review has a graph edge leading to
exactly one movie, so each initial document (a movie review) should have one
edge to traverse to another document (a movie) at a depth of 1. And, each movie
document has no out-going edges to traverse, so the traversal depth would not
proceed beyond depth 1 regardless of the value for max_depth
. We demonstrate
deeper and more complex strategies in other examples.
For more details, see the documentation on GraphRetriever strategy.
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever
retriever = GraphRetriever(
store=vectorstore,
edges=[("reviewed_movie_id", "movie_id")],
strategy=Eager(start_k=10, adjacent_k=10, select_k=100, max_depth=1),
)
INITIAL_PROMPT_TEXT = "What are some good family movies?"
# INITIAL_PROMPT_TEXT = "What are some recommendations of exciting action movies?"
# INITIAL_PROMPT_TEXT = "What are some classic movies with amazing cinematography?"
# invoke the query
query_results = retriever.invoke(INITIAL_PROMPT_TEXT)
# print the raw retrieved results
for result in query_results:
print(result.metadata["doc_type"], ": ", result.page_content)
print(result.metadata)
print()
Compile Graph RAG results¶
Now that we have completed graph retrieval, we can reformat the text and metadata in the results, so we can pass them to an LLM---via an augmented prompt---and generate a response to the initial prompt question.
# collect the movie info for each film retrieved
compiled_results = {}
for result in query_results:
if result.metadata["doc_type"] == "movie_info":
movie_id = result.metadata["movie_id"]
movie_title = result.metadata["title"]
compiled_results[movie_id] = {
"movie_id": movie_id,
"movie_title": movie_title,
"reviews": {},
}
# go through the results a second time, collecting the retreived reviews for
# each of the movies
for result in query_results:
if result.metadata["doc_type"] == "movie_review":
reviewed_movie_id = result.metadata["reviewed_movie_id"]
review_id = result.metadata["reviewId"]
review_text = result.page_content
compiled_results[reviewed_movie_id]["reviews"][review_id] = review_text
# compile the retrieved movies and reviews into a string that we can pass to an
# LLM in an augmented prompt
formatted_text = ""
for movie_id, review_list in compiled_results.items():
formatted_text += "\n\n Movie Title: "
formatted_text += review_list["movie_title"]
formatted_text += "\n Movie ID: "
formatted_text += review_list["movie_id"]
for review_id, review_text in review_list["reviews"].items():
formatted_text += "\n Review: "
formatted_text += review_text
print(formatted_text)
Get an AI summary of results¶
Here, using the formatted_text
from above, we set up a prompt template, and
then pass it the retrieved movie reviews along with the the original query text
to be answered.
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pprint import pprint
MODEL = ChatOpenAI(model="gpt-4o", temperature=0)
VECTOR_ANSWER_PROMPT = PromptTemplate.from_template("""
A list of Movie Reviews appears below. Please answer the Initial Prompt text
(below) using only the listed Movie Reviews.
Please include all movies that might be helpful to someone looking for movie
recommendations.
Initial Prompt:
{initial_prompt}
Movie Reviews:
{movie_reviews}
""")
formatted_prompt = VECTOR_ANSWER_PROMPT.format(
initial_prompt=INITIAL_PROMPT_TEXT,
movie_reviews=formatted_text,
)
result = MODEL.invoke(formatted_prompt)
# print(formatted_prompt)
print(result.content)