Graph RAG on Movie Reviews from Rotten Tomatoes¶

This notebook presents a basic case study for using graph RAG techniques to combine the power of retrieval-augmented generation (RAG) with knowledge graphs based on datasets that are linked to one another in a natural way.

In particular, we use the GraphRetriever implementation in LangChain. For more information, see the open-source Graph RAG project on GitHub

The Dataset¶

The website Rotten Tomatoes has published a large dataset of movie reviews. The dataset includes two CSV files containing:

the movie reviews, and
information about the movies referenced in those reviews

The Challenge¶

In this case study, the challenge is to build a system that allows users to search movie review content using arbitrary prompts, and then return the top reviews together with the full information about the reviewed movies.

The Strategy¶

First, we build a standard RAG system for querying the movie reviews, which are embedded and stored in a vector database. It is important to note that in this step, we store the embedded reviews together with metadata that is necessary for traversing the knowledge graph and linking reviews with the movie data.

Second, we use a GraphRetriever that is configured specifically to:

retrieve relevant movie reviews via standard RAG,
traverse the knowledge graph edges to the relevant movies, and
return the full movie data together with each movie review.

In this implementation, the metadata is the basis for the knowledge graph, and the mechanics of graph traversal is specified as part of the GraphRetriever. In this way, a change in the configuration of the GraphRetriever changes the way that graph edges are defined and how the implied knowledge graph is traversed. There is no need to modify the data set or re-build the knoweledge graph beyond specifying a new GraphRetriever configuration.

See below for how to build this graph RAG system.

In [ ]:

Copied!





# install the required packages
%pip install \
        dotenv \
        pandas \
        langchain_openai \
        langchain-graph-retriever \
        langchain-astradb
# install the required packages
%pip install \
        dotenv \
        pandas \
        langchain_openai \
        langchain-graph-retriever \
        langchain-astradb

Environment Setup¶

This notebook uses the APIs for OpenAI and Astra DB

NOTE: the environment variables for Astra DB are not required if running only the code with the small data sample below, but are required for the code below that works with the full dataset.

You can get an OpenAI API key here. And, more information about using the OpenAI API in Python can be found here.

Here are the instructions to set up a free Astra serverless database.

To connect to these services within this notebook, the following environment variables are required (or optional, as noted):

OPENAI_API_KEY: Your OpenAI API key.
ASTRA_DB_API_ENDPOINT: The Astra DB API endpoint.
ASTRA_DB_APPLICATION_TOKEN: The Astra DB Application token.
ASTRA_DB_KEYSPACE: Optional. If defined, will specify the Astra DB keyspace. If not defined, will use the default.
LANGCHAIN_API_KEY: Optional. If defined, will enable LangSmith tracing.

If running this notebook in Colab, configure these environment variables as Colab Secrets.

If running this notebook locally, make sure you have a .env file containing all of the required variables, and then use the dotenv package as below to load environment variables from that file. More details on dotenv can be found here.

In [ ]:

Copied!

from dotenv import load_dotenv

# load environment variables from the .env file
load_dotenv()
from dotenv import load_dotenv

# load environment variables from the .env file
load_dotenv()

Loading the data¶

The website Rotten Tomatoes has published a large dataset of movie reviews. containing:

rotten_tomatoes_movie_reviews.csv -- the movie reviews
rotten_tomatoes_movies.csv -- information about the movies referenced in those reviews

Below, we first give a small sample dataset contained in this notebook, so that you can try this implementation of graph RAG without needing to download and process the full dataset from files.

Or, you can skip loading this data sample and proceed directly to "Loading the full dataset from file" below.

Loading a small data sample¶

Below is a sample dataset that is coded into this notebook as string objects and then read into pandas dataframes using StringIO.

In [ ]:

Copied!





import pandas as pd
from io import StringIO

reviews_data_string = """
id,reviewId,creationDate,criticName,isTopCritic,originalScore,reviewState,publicatioName,reviewText,scoreSentiment,reviewUrl
addams_family,2644238,2019-11-10,James Kendrick,False,3/4,fresh,Q Network Film Desk,captures the family's droll humor with just the right mixture of morbidity and genuine care,POSITIVE,http://www.qnetwork.com/review/4178
addams_family,2509777,2018-09-12,John Ferguson,False,4/5,fresh,Radio Times,A witty family comedy that has enough sly humour to keep adults chuckling throughout.,POSITIVE,https://www.radiotimes.com/film/fj8hmt/the-addams-family/
addams_family,26216,2000-01-01,Rita Kempley,True,,fresh,Washington Post,"More than merely a sequel of the TV series, the film is a compendium of paterfamilias Charles Addams's macabre drawings, a resurrection of the cartoonist's body of work. For family friends, it would seem a viewing is de rigueur mortis.",POSITIVE,http://www.washingtonpost.com/wp-srv/style/longterm/movies/videos/theaddamsfamilypg13kempley_a0a280.htm
the_addams_family_2019,2699537,2020-06-27,Damond Fudge,False,,fresh,"KCCI (Des Moines, IA)","As was proven by the 1992-93 cartoon series, animation is the perfect medium for this creepy, kooky family, allowing more outlandish escapades",POSITIVE,https://www.kcci.com/article/movie-review-the-addams-family/29443537
the_addams_family_2019,2662133,2020-01-21,Ryan Silberstein,False,,fresh,Cinema76,"This origin casts the Addams family as an immigrant story, and the film leans so hard into the theme of accepting those different from us and valuing diversity over conformity,",POSITIVE,https://www.cinema76.com/home/2019/10/11/the-addams-family-is-a-fun-update-to-an-iconic-american-clan
the_addams_family_2019,2661356,2020-01-17,Jennifer Heaton,False,5.5/10,rotten,Alternative Lens,...The film's simplistic and episodic plot put a major dampener on what could have been a welcome breath of fresh air for family animation.,NEGATIVE,https://altfilmlens.wordpress.com/2020/01/17/my-end-of-year-surplus-review-extravaganza-thing-2019/
the_addams_family_2,102657551,2022-02-16,Mat Brunet,False,4/10,rotten,AniMat's Review (YouTube),The Addams Family 2 repeats what the first movie accomplished by taking the popular family and turning them into one of the most boringly generic kids films in recent years.,NEGATIVE,https://www.youtube.com/watch?v=G9deslxPDwI
the_addams_family_2,2832101,2021-10-15,Sandie Angulo Chen,False,3/5,fresh,Common Sense Media,This serviceable animated sequel focuses on Wednesday's feelings of alienation and benefits from the family's kid-friendly jokes and road trip adventures.,POSITIVE,https://www.commonsensemedia.org/movie-reviews/the-addams-family-2
the_addams_family_2,2829939,2021-10-08,Emily Breen,False,2/5,rotten,HeyUGuys,"Lifeless and flat, doing a disservice to the family name and the talent who voice them. WIthout glamour, wit or a hint of a soul. A void. Avoid.",NEGATIVE,https://www.heyuguys.com/the-addams-family-2-review/
addams_family_values,102735159,2022-09-22,Sean P. Means,False,3/4,fresh,Salt Lake Tribune,Addams Family Values is a ghoulishly fun time. It would have been a real howl if the producers weren't too scared to go out on a limb in this twisted family tree.,POSITIVE,https://www.newspapers.com/clip/110004014/addams-family-values/
addams_family_values,102734540,2022-09-21,Jami Bernard,True,3.5/4,fresh,New York Daily News,"The title is apt. Using those morbidly sensual cartoon characters as pawns, the new movie Addams Family Values launches a witty assault on those with fixed ideas about what constitutes a loving family. ",POSITIVE,https://www.newspapers.com/clip/109964753/addams-family-values/
addams_family_values,102734521,2022-09-21,Jeff Simon,False,3/4,fresh,Buffalo News,"Addams Family Values has its moments -- rather a lot of them, in fact. You knew that just from the title, which is a nice way of turning Charles Addams' family of ghouls, monsters and vampires loose on Dan Quayle.",POSITIVE,https://buffalonews.com/news/quirky-values-the-addams-family-returns-with-a-bouncing-baby/article_2aafde74-da6c-5fa7-924a-76bb1a906d9c.html
"""

movies_data_string = """
id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix
addams_family,The Addams Family,66,67,,,1991-11-22,2005-08-18,99,Comedy,English,Barry Sonnenfeld,"Charles Addams,Caroline Thompson,Larry Wilson",$111.3M,Paramount Pictures,"Surround, Dolby SR"
the_addams_family_2019,The Addams Family,69,45,PG,"['Some Action', 'Macabre and Suggestive Humor']",2019-10-11,2019-10-11,87,"Kids & family, Comedy, Animation",English,"Conrad Vernon,Greg Tiernan","Matt Lieberman,Erica Rivinoja",$673.0K,Metro-Goldwyn-Mayer,Dolby Atmos
the_addams_family_2,The Addams Family 2,69,28,PG,"['Macabre and Rude Humor', 'Language', 'Violence']",2021-10-01,2021-10-01,93,"Kids & family, Comedy, Adventure, Animation",English,"Greg Tiernan,Conrad Vernon","Dan Hernandez,Benji Samit,Ben Queen,Susanna Fogel",$56.5M,Metro-Goldwyn-Mayer,
addams_family_reunion,Addams Family Reunion,33,,,,,,92,Comedy,English,Dave Payne,,,,
addams_family_values,Addams Family Values,63,75,,,1993-11-19,2003-08-05,93,Comedy,English,Barry Sonnenfeld,Paul Rudnick,$45.7M,"Argentina Video Home, Paramount Pictures","Surround, Dolby Digital"
"""

reviews_all = pd.read_csv(StringIO(reviews_data_string))
movies_all = pd.read_csv(StringIO(movies_data_string))
import pandas as pd
from io import StringIO

reviews_data_string = """
id,reviewId,creationDate,criticName,isTopCritic,originalScore,reviewState,publicatioName,reviewText,scoreSentiment,reviewUrl
addams_family,2644238,2019-11-10,James Kendrick,False,3/4,fresh,Q Network Film Desk,captures the family's droll humor with just the right mixture of morbidity and genuine care,POSITIVE,http://www.qnetwork.com/review/4178
addams_family,2509777,2018-09-12,John Ferguson,False,4/5,fresh,Radio Times,A witty family comedy that has enough sly humour to keep adults chuckling throughout.,POSITIVE,https://www.radiotimes.com/film/fj8hmt/the-addams-family/
addams_family,26216,2000-01-01,Rita Kempley,True,,fresh,Washington Post,"More than merely a sequel of the TV series, the film is a compendium of paterfamilias Charles Addams's macabre drawings, a resurrection of the cartoonist's body of work. For family friends, it would seem a viewing is de rigueur mortis.",POSITIVE,http://www.washingtonpost.com/wp-srv/style/longterm/movies/videos/theaddamsfamilypg13kempley_a0a280.htm
the_addams_family_2019,2699537,2020-06-27,Damond Fudge,False,,fresh,"KCCI (Des Moines, IA)","As was proven by the 1992-93 cartoon series, animation is the perfect medium for this creepy, kooky family, allowing more outlandish escapades",POSITIVE,https://www.kcci.com/article/movie-review-the-addams-family/29443537
the_addams_family_2019,2662133,2020-01-21,Ryan Silberstein,False,,fresh,Cinema76,"This origin casts the Addams family as an immigrant story, and the film leans so hard into the theme of accepting those different from us and valuing diversity over conformity,",POSITIVE,https://www.cinema76.com/home/2019/10/11/the-addams-family-is-a-fun-update-to-an-iconic-american-clan
the_addams_family_2019,2661356,2020-01-17,Jennifer Heaton,False,5.5/10,rotten,Alternative Lens,...The film's simplistic and episodic plot put a major dampener on what could have been a welcome breath of fresh air for family animation.,NEGATIVE,https://altfilmlens.wordpress.com/2020/01/17/my-end-of-year-surplus-review-extravaganza-thing-2019/
the_addams_family_2,102657551,2022-02-16,Mat Brunet,False,4/10,rotten,AniMat's Review (YouTube),The Addams Family 2 repeats what the first movie accomplished by taking the popular family and turning them into one of the most boringly generic kids films in recent years.,NEGATIVE,https://www.youtube.com/watch?v=G9deslxPDwI
the_addams_family_2,2832101,2021-10-15,Sandie Angulo Chen,False,3/5,fresh,Common Sense Media,This serviceable animated sequel focuses on Wednesday's feelings of alienation and benefits from the family's kid-friendly jokes and road trip adventures.,POSITIVE,https://www.commonsensemedia.org/movie-reviews/the-addams-family-2
the_addams_family_2,2829939,2021-10-08,Emily Breen,False,2/5,rotten,HeyUGuys,"Lifeless and flat, doing a disservice to the family name and the talent who voice them. WIthout glamour, wit or a hint of a soul. A void. Avoid.",NEGATIVE,https://www.heyuguys.com/the-addams-family-2-review/
addams_family_values,102735159,2022-09-22,Sean P. Means,False,3/4,fresh,Salt Lake Tribune,Addams Family Values is a ghoulishly fun time. It would have been a real howl if the producers weren't too scared to go out on a limb in this twisted family tree.,POSITIVE,https://www.newspapers.com/clip/110004014/addams-family-values/
addams_family_values,102734540,2022-09-21,Jami Bernard,True,3.5/4,fresh,New York Daily News,"The title is apt. Using those morbidly sensual cartoon characters as pawns, the new movie Addams Family Values launches a witty assault on those with fixed ideas about what constitutes a loving family. ",POSITIVE,https://www.newspapers.com/clip/109964753/addams-family-values/
addams_family_values,102734521,2022-09-21,Jeff Simon,False,3/4,fresh,Buffalo News,"Addams Family Values has its moments -- rather a lot of them, in fact. You knew that just from the title, which is a nice way of turning Charles Addams' family of ghouls, monsters and vampires loose on Dan Quayle.",POSITIVE,https://buffalonews.com/news/quirky-values-the-addams-family-returns-with-a-bouncing-baby/article_2aafde74-da6c-5fa7-924a-76bb1a906d9c.html
"""

movies_data_string = """
id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix
addams_family,The Addams Family,66,67,,,1991-11-22,2005-08-18,99,Comedy,English,Barry Sonnenfeld,"Charles Addams,Caroline Thompson,Larry Wilson",$111.3M,Paramount Pictures,"Surround, Dolby SR"
the_addams_family_2019,The Addams Family,69,45,PG,"['Some Action', 'Macabre and Suggestive Humor']",2019-10-11,2019-10-11,87,"Kids & family, Comedy, Animation",English,"Conrad Vernon,Greg Tiernan","Matt Lieberman,Erica Rivinoja",$673.0K,Metro-Goldwyn-Mayer,Dolby Atmos
the_addams_family_2,The Addams Family 2,69,28,PG,"['Macabre and Rude Humor', 'Language', 'Violence']",2021-10-01,2021-10-01,93,"Kids & family, Comedy, Adventure, Animation",English,"Greg Tiernan,Conrad Vernon","Dan Hernandez,Benji Samit,Ben Queen,Susanna Fogel",$56.5M,Metro-Goldwyn-Mayer,
addams_family_reunion,Addams Family Reunion,33,,,,,,92,Comedy,English,Dave Payne,,,,
addams_family_values,Addams Family Values,63,75,,,1993-11-19,2003-08-05,93,Comedy,English,Barry Sonnenfeld,Paul Rudnick,$45.7M,"Argentina Video Home, Paramount Pictures","Surround, Dolby Digital"
"""

reviews_all = pd.read_csv(StringIO(reviews_data_string))
movies_all = pd.read_csv(StringIO(movies_data_string))

Pre-processing the data¶

First, we rename one column in each of the two dataframes so that we can use them later to build a knowledge graph.

In [ ]:

Copied!

# rename the id columns to more informative and useful names
reviews_data = reviews_all.rename(columns={"id": "reviewed_movie_id"})
movies_data = movies_all.rename(columns={"id": "movie_id"})
# rename the id columns to more informative and useful names
reviews_data = reviews_all.rename(columns={"id": "reviewed_movie_id"})
movies_data = movies_all.rename(columns={"id": "movie_id"})

Create the vector store, with embedding¶

Next, for the small data sample, we create an InMemoryVectorStore from LangChain using OpenAIEmbeddings() to embed the documents.

In [ ]:

Copied!

from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

# create the vector store
vectorstore = InMemoryVectorStore(OpenAIEmbeddings())
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

# create the vector store
vectorstore = InMemoryVectorStore(OpenAIEmbeddings())

Loading the full dataset from file¶

Before running this code, make sure you have downloaded (and extracted) the dataset from the link provided above. The date files should be in your working directory, or you will need to change the file paths below to match the locations of your files.

See the top of this notebook for links and information about the datasets.

In [ ]:

skip-execution

Copied!





import pandas as pd

# Change this to the path where you stored the data files. See the top of this
# notebook for links and information about the datasets.
DATA_PATH = "../../../../datasets/"

# read the datasets from CSV files
reviews_all = pd.read_csv(DATA_PATH + "rotten_tomatoes_movie_reviews.csv")
movies_all = pd.read_csv(DATA_PATH + "rotten_tomatoes_movies.csv")

print("Data is loaded from CSV.")
import pandas as pd

# Change this to the path where you stored the data files. See the top of this
# notebook for links and information about the datasets.
DATA_PATH = "../../../../datasets/"

# read the datasets from CSV files
reviews_all = pd.read_csv(DATA_PATH + "rotten_tomatoes_movie_reviews.csv")
movies_all = pd.read_csv(DATA_PATH + "rotten_tomatoes_movies.csv")

print("Data is loaded from CSV.")

Pre-processing the data¶

First, we rename one column in each of the two dataframes so that we can use them later to build a knowledge graph.

In [ ]:

skip-execution

Copied!

# rename the id columns to more informative and useful names
reviews_all = reviews_all.rename(columns={"id": "reviewed_movie_id"})
movies_all = movies_all.rename(columns={"id": "movie_id"})
# rename the id columns to more informative and useful names
reviews_all = reviews_all.rename(columns={"id": "reviewed_movie_id"})
movies_all = movies_all.rename(columns={"id": "movie_id"})

Next, let's have a look at the movies that have the most reviews, and take a subset of the reviews to save time in this demo.

In [ ]:

skip-execution

Copied!





# Here, we limit our dataset to the movies with the most reviews. This is simply
# to save data processing and loading time while testing things in this notebook.
N_TOP_MOVIES = 10
most_reviewed_movies = reviews_all["reviewed_movie_id"].value_counts()[:N_TOP_MOVIES]

most_reviewed_movies
# Here, we limit our dataset to the movies with the most reviews. This is simply
# to save data processing and loading time while testing things in this notebook.
N_TOP_MOVIES = 10
most_reviewed_movies = reviews_all["reviewed_movie_id"].value_counts()[:N_TOP_MOVIES]

most_reviewed_movies

In [ ]:

skip-execution

Copied!





# subset the data to only reviews and movies corresponding to the most reviewed movies
reviews_data = reviews_all[
    reviews_all["reviewed_movie_id"].isin(most_reviewed_movies.index)
]
movies_data = movies_all[movies_all["movie_id"].isin(most_reviewed_movies.index)]
# subset the data to only reviews and movies corresponding to the most reviewed movies
reviews_data = reviews_all[
    reviews_all["reviewed_movie_id"].isin(most_reviewed_movies.index)
]
movies_data = movies_all[movies_all["movie_id"].isin(most_reviewed_movies.index)]

Create the vector store, with embedding¶

Next, for the small data sample, we create an AstraDBVectorStore from LangChain using OpenAIEmbeddings() to embed the documents.

In [ ]:

skip-execution

Copied!





from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

COLLECTION = "movie_reviews_rotten_tomatoes"
vectorstore = AstraDBVectorStore(
    embedding=OpenAIEmbeddings(),
    collection_name=COLLECTION,
    pre_delete_collection=True,
)
from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

COLLECTION = "movie_reviews_rotten_tomatoes"
vectorstore = AstraDBVectorStore(
    embedding=OpenAIEmbeddings(),
    collection_name=COLLECTION,
    pre_delete_collection=True,
)

Convert data to `Document` objects and store them¶

Next, we convert both movies and movie reviews into LangChain Document objects. The content of each document---which is embedded into vectors---is configured to be the movie review text (for review documents) or the movie title (for movie documents). All remaining information is saved as metadata on each document.

Note that to save time in this demo, we limit the dataset to include only the movies that have the most reviews.

In [ ]:

Copied!





from langchain_core.documents import Document

# Convert each movie review into a LangChain document
documents = []
# convert each movie into a LangChain document
for index, row in movies_data.iterrows():
    content = str(row["title"])
    metadata = row.fillna("").astype(str).to_dict()
    metadata["doc_type"] = "movie_info"
    document = Document(page_content=content, metadata=metadata)
    documents.append(document)


for index, row in reviews_data.iterrows():
    content = str(row["reviewText"])
    metadata = row.drop("reviewText").fillna("").astype(str).to_dict()
    metadata["doc_type"] = "movie_review"
    document = Document(page_content=content, metadata=metadata)
    documents.append(document)


# check the total number of documents
print("There are", len(documents), "total Documents")
from langchain_core.documents import Document

# Convert each movie review into a LangChain document
documents = []
# convert each movie into a LangChain document
for index, row in movies_data.iterrows():
    content = str(row["title"])
    metadata = row.fillna("").astype(str).to_dict()
    metadata["doc_type"] = "movie_info"
    document = Document(page_content=content, metadata=metadata)
    documents.append(document)


for index, row in reviews_data.iterrows():
    content = str(row["reviewText"])
    metadata = row.drop("reviewText").fillna("").astype(str).to_dict()
    metadata["doc_type"] = "movie_review"
    document = Document(page_content=content, metadata=metadata)
    documents.append(document)


# check the total number of documents
print("There are", len(documents), "total Documents")

In [ ]:

Copied!

# let's inspect the structure of a document
from pprint import pprint

pprint(documents[0].metadata)
# let's inspect the structure of a document
from pprint import pprint

pprint(documents[0].metadata)

In [ ]:

Copied!

# add documents to the store
vectorstore.add_documents(documents)

# NOTE: this may take some minutes to load many documents
# add documents to the store
vectorstore.add_documents(documents)

# NOTE: this may take some minutes to load many documents

Setting up the GraphRetriever¶

The GraphRetriever operates on top of the vector store, using document metadata to traverse the implicit knowledge graph as defined by the edges parameter in GraphRetriever configuration.

Edges are specified as directed pairs of metadata fields. In the example below, the edge configuration

edges = [("reviewed_movie_id", "movie_id")]

specifies that there is a directed graph edge between two documents whenever the reviewed_movie_id of the first document matches the movie_id of the second---and graph traversal proceeds along these directed edges. In this case, all of our edges lead from a document containing a movie review to a document containing information about the movie.

The strategy parameter of the GraphRetriever configuration determines how the graph is traversed, starting with the initial documents retrieved and proceeding along the directed edges to adjacent documents.

In the example below, the configuration

strategy=Eager(start_k=10,
               adjacent_k=10,
               select_k=10,
               max_depth=1)

uses the following steps:

it initially retrieves start_k=10 documents using pure vector search,
then traverses graph edges from the initial documents to adjacent documents (a max of adjacent_k),
it repeats traversal from the new documents until reaching max_depth=1,
it returns both the initial documents and documents retrieved during traversal, up to a maximum of select_k documents.

Note that in this simple example, each movie review has a graph edge leading to exactly one movie, so each initial document (a movie review) should have one edge to traverse to another document (a movie) at a depth of 1. And, each movie document has no out-going edges to traverse, so the traversal depth would not proceed beyond depth 1 regardless of the value for max_depth. We demonstrate deeper and more complex strategies in other examples.

For more details, see the documentation on GraphRetriever strategy.

In [ ]:

Copied!





from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

retriever = GraphRetriever(
    store=vectorstore,
    edges=[("reviewed_movie_id", "movie_id")],
    strategy=Eager(start_k=10, adjacent_k=10, select_k=100, max_depth=1),
)
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

retriever = GraphRetriever(
    store=vectorstore,
    edges=[("reviewed_movie_id", "movie_id")],
    strategy=Eager(start_k=10, adjacent_k=10, select_k=100, max_depth=1),
)

In [ ]:

Copied!





INITIAL_PROMPT_TEXT = "What are some good family movies?"
# INITIAL_PROMPT_TEXT = "What are some recommendations of exciting action movies?"
# INITIAL_PROMPT_TEXT = "What are some classic movies with amazing cinematography?"


# invoke the query
query_results = retriever.invoke(INITIAL_PROMPT_TEXT)

# print the raw retrieved results
for result in query_results:
    print(result.metadata["doc_type"], ": ", result.page_content)
    print(result.metadata)
    print()
INITIAL_PROMPT_TEXT = "What are some good family movies?"
# INITIAL_PROMPT_TEXT = "What are some recommendations of exciting action movies?"
# INITIAL_PROMPT_TEXT = "What are some classic movies with amazing cinematography?"


# invoke the query
query_results = retriever.invoke(INITIAL_PROMPT_TEXT)

# print the raw retrieved results
for result in query_results:
    print(result.metadata["doc_type"], ": ", result.page_content)
    print(result.metadata)
    print()

Compile Graph RAG results¶

Now that we have completed graph retrieval, we can reformat the text and metadata in the results, so we can pass them to an LLM---via an augmented prompt---and generate a response to the initial prompt question.

In [ ]:

Copied!





# collect the movie info for each film retrieved
compiled_results = {}
for result in query_results:
    if result.metadata["doc_type"] == "movie_info":
        movie_id = result.metadata["movie_id"]
        movie_title = result.metadata["title"]
        compiled_results[movie_id] = {
            "movie_id": movie_id,
            "movie_title": movie_title,
            "reviews": {},
        }

# go through the results a second time, collecting the retreived reviews for
# each of the movies
for result in query_results:
    if result.metadata["doc_type"] == "movie_review":
        reviewed_movie_id = result.metadata["reviewed_movie_id"]
        review_id = result.metadata["reviewId"]
        review_text = result.page_content
        compiled_results[reviewed_movie_id]["reviews"][review_id] = review_text


# compile the retrieved movies and reviews into a string that we can pass to an
# LLM in an augmented prompt
formatted_text = ""
for movie_id, review_list in compiled_results.items():
    formatted_text += "\n\n Movie Title: "
    formatted_text += review_list["movie_title"]
    formatted_text += "\n Movie ID: "
    formatted_text += review_list["movie_id"]
    for review_id, review_text in review_list["reviews"].items():
        formatted_text += "\n Review: "
        formatted_text += review_text


print(formatted_text)
# collect the movie info for each film retrieved
compiled_results = {}
for result in query_results:
    if result.metadata["doc_type"] == "movie_info":
        movie_id = result.metadata["movie_id"]
        movie_title = result.metadata["title"]
        compiled_results[movie_id] = {
            "movie_id": movie_id,
            "movie_title": movie_title,
            "reviews": {},
        }

# go through the results a second time, collecting the retreived reviews for
# each of the movies
for result in query_results:
    if result.metadata["doc_type"] == "movie_review":
        reviewed_movie_id = result.metadata["reviewed_movie_id"]
        review_id = result.metadata["reviewId"]
        review_text = result.page_content
        compiled_results[reviewed_movie_id]["reviews"][review_id] = review_text


# compile the retrieved movies and reviews into a string that we can pass to an
# LLM in an augmented prompt
formatted_text = ""
for movie_id, review_list in compiled_results.items():
    formatted_text += "\n\n Movie Title: "
    formatted_text += review_list["movie_title"]
    formatted_text += "\n Movie ID: "
    formatted_text += review_list["movie_id"]
    for review_id, review_text in review_list["reviews"].items():
        formatted_text += "\n Review: "
        formatted_text += review_text


print(formatted_text)

Get an AI summary of results¶

Here, using the formatted_text from above, we set up a prompt template, and then pass it the retrieved movie reviews along with the the original query text to be answered.

In [ ]:

Copied!

from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pprint import pprint

MODEL = ChatOpenAI(model="gpt-4o", temperature=0)

VECTOR_ANSWER_PROMPT = PromptTemplate.from_template("""

A list of Movie Reviews appears below. Please answer the Initial Prompt text
(below) using only the listed Movie Reviews.

Please include all movies that might be helpful to someone looking for movie
recommendations.

Initial Prompt:
{initial_prompt}

Movie Reviews:
{movie_reviews}
""")

formatted_prompt = VECTOR_ANSWER_PROMPT.format(
    initial_prompt=INITIAL_PROMPT_TEXT,
    movie_reviews=formatted_text,
)

result = MODEL.invoke(formatted_prompt)

# print(formatted_prompt)
print(result.content)
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pprint import pprint

MODEL = ChatOpenAI(model="gpt-4o", temperature=0)

VECTOR_ANSWER_PROMPT = PromptTemplate.from_template("""

A list of Movie Reviews appears below. Please answer the Initial Prompt text
(below) using only the listed Movie Reviews.

Please include all movies that might be helpful to someone looking for movie
recommendations.

Initial Prompt:
{initial_prompt}

Movie Reviews:
{movie_reviews}
""")

formatted_prompt = VECTOR_ANSWER_PROMPT.format(
    initial_prompt=INITIAL_PROMPT_TEXT,
    movie_reviews=formatted_text,
)

result = MODEL.invoke(formatted_prompt)

# print(formatted_prompt)
print(result.content)

Graph RAG on Movie Reviews from Rotten Tomatoes¶

The Dataset¶

The Challenge¶

The Strategy¶

Environment Setup¶

Loading the data¶

Loading a small data sample¶

Pre-processing the data¶

Create the vector store, with embedding¶

Loading the full dataset from file¶

Pre-processing the data¶

Create the vector store, with embedding¶

Convert data to Document objects and store them¶

Setting up the GraphRetriever¶

Compile Graph RAG results¶

Get an AI summary of results¶

Convert data to `Document` objects and store them¶