langchain_graph_retriever.transformers¶
Package containing useful Document Transformers.
Many of these add metadata that could be useful for linking content, such as extracting named entities or keywords from the page content.
Also includes a transformer for shredding metadata, for use with stores that do not support querying on elements of lists.
ParentTransformer ¶
ParentTransformer(
*,
path_metadata_key: str = "path",
parent_metadata_key: str = "parent",
path_delimiter: str = "\\",
)
Bases: BaseDocumentTransformer
Adds the hierarchal Parent path to the document metadata.
PARAMETER | DESCRIPTION |
---|---|
path_metadata_key
|
Metadata key containing the path. This may correspond to paths in a file system, hierarchy in a document, etc.
TYPE:
|
parent_metadata_key
|
Metadata key for the added parent path
TYPE:
|
path_delimiter
|
Delimiter of items in the path.
TYPE:
|
Example
An example of how to use this transformer exists HERE in the guide.
Notes
Expects each document to contain its path in its metadata.
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/parent.py
ShreddingTransformer ¶
ShreddingTransformer(
*,
keys: set[str] = set(),
path_delimiter: str = DEFAULT_PATH_DELIMITER,
static_value: Any = DEFAULT_STATIC_VALUE,
)
Bases: BaseDocumentTransformer
Shreds sequence-based metadata fields.
Certain vector stores do not support storing or searching on metadata fields with sequence-based values. This transformer converts sequence-based fields into simple metadata values.
Example
An example of how to use this transformer exists HERE in the guide.
PARAMETER | DESCRIPTION |
---|---|
keys
|
A set of metadata keys to shred. If empty, all sequence-based fields will be shredded. |
path_delimiter
|
The path delimiter to use when building shredded keys.
TYPE:
|
static_value
|
The value to set on each shredded key.
TYPE:
|
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
restore_documents ¶
Restore documents transformed by the ShreddingTransformer.
Restore documents transformed by the ShreddingTransformer back to their original state before shredding.
Note that any non-string values inside lists will be converted to strings after restoring.
Args: documents: A sequence of Documents to be transformed.
RETURNS | DESCRIPTION |
---|---|
Sequence[Document]
|
A sequence of transformed Documents. |
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
shredded_key ¶
gliner ¶
GLiNERTransformer ¶
GLiNERTransformer(
labels: list[str],
*,
batch_size: int = 8,
metadata_key_prefix: str = "",
model: str | GLiNER = "urchade/gliner_mediumv2.1",
)
Bases: BaseDocumentTransformer
Add metadata to documents about named entities using GLiNER.
Extracts structured entity labels from text, identifying key attributes and categories to enrich document metadata with semantic information.
GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like).
Prerequisites
This transformer requires the gliner
extra to be installed.
Example
An example of how to use this transformer exists HERE in the guide.
PARAMETER | DESCRIPTION |
---|---|
labels
|
List of entity kinds to extract. |
batch_size
|
The number of documents to process in each batch.
TYPE:
|
metadata_key_prefix
|
A prefix to add to metadata keys outputted by the extractor. This will be prepended to the label, with the value (or values) holding the generated keywords for that entity kind.
TYPE:
|
model
|
The GLiNER model to use. Pass the name of a model to load or pass an instantiated GLiNER model instance.
TYPE:
|
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/gliner.py
html ¶
HyperlinkTransformer ¶
HyperlinkTransformer(
*,
url_metadata_key: str = "url",
metadata_key: str = "hyperlink",
drop_fragments: bool = True,
)
Bases: BaseDocumentTransformer
Extracts hyperlinks from HTML content and stores them in document metadata.
Prerequisites
This transformer requires the html
extra to be installed.
Example
An example of how to use this transformer exists HERE in the guide.
PARAMETER | DESCRIPTION |
---|---|
url_metadata_key
|
The metadata field containing the URL of the document. Must be set before transforming. Needed to resolve relative paths.
TYPE:
|
metadata_key
|
The metadata field to populate with documents linked from this content.
TYPE:
|
drop_fragments
|
Whether fragments in URLs and links should be dropped.
TYPE:
|
Notes
Expects each document to contain its URL in its metadata.
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/html.py
keybert ¶
KeyBERTTransformer ¶
KeyBERTTransformer(
*,
batch_size: int = 8,
metadata_key: str = "keywords",
model: str | KeyBERT = "all-MiniLM-L6-v2",
)
Bases: BaseDocumentTransformer
Add metadata to documents about keywords using KeyBERT.
Extracts key topics and concepts from text, generating metadata that highlights the most relevant terms to describe the content.
KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.
Prerequisites
This transformer requires the keybert
extra to be installed.
Example
An example of how to use this transformer exists HERE in the guide.
PARAMETER | DESCRIPTION |
---|---|
batch_size
|
The number of documents to process in each batch.
TYPE:
|
metadata_key
|
The name of the key used in the metadata output.
TYPE:
|
model
|
The KeyBERT model to use. Pass the name of a model to load or pass an instantiated KeyBERT model instance.
TYPE:
|
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/keybert.py
parent ¶
ParentTransformer ¶
ParentTransformer(
*,
path_metadata_key: str = "path",
parent_metadata_key: str = "parent",
path_delimiter: str = "\\",
)
Bases: BaseDocumentTransformer
Adds the hierarchal Parent path to the document metadata.
PARAMETER | DESCRIPTION |
---|---|
path_metadata_key
|
Metadata key containing the path. This may correspond to paths in a file system, hierarchy in a document, etc.
TYPE:
|
parent_metadata_key
|
Metadata key for the added parent path
TYPE:
|
path_delimiter
|
Delimiter of items in the path.
TYPE:
|
Example
An example of how to use this transformer exists HERE in the guide.
Notes
Expects each document to contain its path in its metadata.
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/parent.py
shredding ¶
Shredding Transformer for sequence-based metadata fields.
ShreddingTransformer ¶
ShreddingTransformer(
*,
keys: set[str] = set(),
path_delimiter: str = DEFAULT_PATH_DELIMITER,
static_value: Any = DEFAULT_STATIC_VALUE,
)
Bases: BaseDocumentTransformer
Shreds sequence-based metadata fields.
Certain vector stores do not support storing or searching on metadata fields with sequence-based values. This transformer converts sequence-based fields into simple metadata values.
Example
An example of how to use this transformer exists HERE in the guide.
PARAMETER | DESCRIPTION |
---|---|
keys
|
A set of metadata keys to shred. If empty, all sequence-based fields will be shredded. |
path_delimiter
|
The path delimiter to use when building shredded keys.
TYPE:
|
static_value
|
The value to set on each shredded key.
TYPE:
|
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
restore_documents ¶
Restore documents transformed by the ShreddingTransformer.
Restore documents transformed by the ShreddingTransformer back to their original state before shredding.
Note that any non-string values inside lists will be converted to strings after restoring.
Args: documents: A sequence of Documents to be transformed.
RETURNS | DESCRIPTION |
---|---|
Sequence[Document]
|
A sequence of transformed Documents. |
Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
shredded_key ¶
spacy ¶
SpacyNERTransformer ¶
SpacyNERTransformer(
*,
include_labels: set[str] = set(),
exclude_labels: set[str] = set(),
limit: int | None = None,
metadata_key: str = "entities",
model: str | Language = "en_core_web_sm",
)
Bases: BaseDocumentTransformer
Add metadata to documents about named entities using spaCy.
Identifies and labels named entities in text, extracting structured metadata such as organizations, locations, dates, and other key entity types.
spaCy is a library for Natural Language Processing in Python. Here it is used for Named Entity Recognition (NER) to extract values from document content.
Prerequisites
This transformer requires the spacy
extra to be installed.
Example
An example of how to use this transformer exists HERE in the guide.
PARAMETER | DESCRIPTION |
---|---|
include_labels
|
Set of entity labels to include. Will include all labels if empty. |
exclude_labels
|
Set of entity labels to exclude. Will not exclude anything if empty. |
metadata_key
|
The metadata key to store the extracted entities in.
TYPE:
|
model
|
The spaCy model to use. Pass the name of a model to load or pass an instantiated spaCy model instance.
TYPE:
|
Notes
See spaCy docs for the selected model to determine what NER labels will be used. The default model en_core_web_sm produces: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART.