Skip to content

langchain_graph_retriever.transformers

Package containing useful Document Transformers.

Many of these add metadata that could be useful for linking content, such as extracting named entities or keywords from the page content.

Also includes a transformer for shredding metadata, for use with stores that do not support querying on elements of lists.

ParentTransformer

ParentTransformer(
    *,
    path_metadata_key: str = "path",
    parent_metadata_key: str = "parent",
    path_delimiter: str = "\\",
)

Bases: BaseDocumentTransformer

Adds the hierarchal Parent path to the document metadata.

PARAMETER DESCRIPTION
path_metadata_key

Metadata key containing the path. This may correspond to paths in a file system, hierarchy in a document, etc.

TYPE: str DEFAULT: 'path'

parent_metadata_key

Metadata key for the added parent path

TYPE: str DEFAULT: 'parent'

path_delimiter

Delimiter of items in the path.

TYPE: str DEFAULT: '\\'

Example

An example of how to use this transformer exists HERE in the guide.

Notes

Expects each document to contain its path in its metadata.

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/parent.py
def __init__(
    self,
    *,
    path_metadata_key: str = "path",
    parent_metadata_key: str = "parent",
    path_delimiter: str = "\\",
):
    self._path_metadata_key = path_metadata_key
    self._parent_metadata_key = parent_metadata_key
    self._path_delimiter = path_delimiter

ShreddingTransformer

ShreddingTransformer(
    *,
    keys: set[str] = set(),
    path_delimiter: str = DEFAULT_PATH_DELIMITER,
    static_value: Any = DEFAULT_STATIC_VALUE,
)

Bases: BaseDocumentTransformer

Shreds sequence-based metadata fields.

Certain vector stores do not support storing or searching on metadata fields with sequence-based values. This transformer converts sequence-based fields into simple metadata values.

Example

An example of how to use this transformer exists HERE in the guide.

PARAMETER DESCRIPTION
keys

A set of metadata keys to shred. If empty, all sequence-based fields will be shredded.

TYPE: set[str] DEFAULT: set()

path_delimiter

The path delimiter to use when building shredded keys.

TYPE: str DEFAULT: DEFAULT_PATH_DELIMITER

static_value

The value to set on each shredded key.

TYPE: Any DEFAULT: DEFAULT_STATIC_VALUE

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def __init__(
    self,
    *,
    keys: set[str] = set(),
    path_delimiter: str = DEFAULT_PATH_DELIMITER,
    static_value: Any = DEFAULT_STATIC_VALUE,
):
    self.keys = keys
    self.path_delimiter = path_delimiter
    self.static_value = static_value

restore_documents

restore_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Restore documents transformed by the ShreddingTransformer.

Restore documents transformed by the ShreddingTransformer back to their original state before shredding.

Note that any non-string values inside lists will be converted to strings after restoring.

Args: documents: A sequence of Documents to be transformed.

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Documents.

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def restore_documents(
    self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
    """
    Restore documents transformed by the ShreddingTransformer.

    Restore documents transformed by the ShreddingTransformer back to
    their original state before shredding.

    Note that any non-string values inside lists will be converted to strings
    after restoring.

    Args:
        documents: A sequence of Documents to be transformed.

    Returns
    -------
    Sequence[Document]
        A sequence of transformed Documents.
    """
    restored_docs = []
    for document in documents:
        new_doc = Document(id=document.id, page_content=document.page_content)
        shredded_keys = set(
            json.loads(document.metadata.pop(SHREDDED_KEYS_KEY, "[]"))
        )

        for key, value in document.metadata.items():
            # Check if the key belongs to a shredded group
            split_key = key.split(self.path_delimiter, 1)
            if (
                len(split_key) == 2
                and split_key[0] in shredded_keys
                and value == self.static_value
            ):
                original_key, original_value = split_key
                value = json.loads(original_value)
                if original_key not in new_doc.metadata:
                    new_doc.metadata[original_key] = []
                new_doc.metadata[original_key].append(value)
            else:
                # Retain non-shredded metadata as is
                new_doc.metadata[key] = value

        restored_docs.append(new_doc)

    return restored_docs

shredded_key

shredded_key(key: str, value: Any) -> str

Get the shredded key for a key/value pair.

PARAMETER DESCRIPTION
key

The metadata key to shred

TYPE: str

value

The metadata value to shred

TYPE: Any

RETURNS DESCRIPTION
str

the shredded key

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def shredded_key(self, key: str, value: Any) -> str:
    """
    Get the shredded key for a key/value pair.

    Parameters
    ----------
    key :
        The metadata key to shred
    value :
        The metadata value to shred

    Returns
    -------
    str
        the shredded key
    """
    return f"{key}{self.path_delimiter}{json.dumps(value)}"

shredded_value

shredded_value() -> str

Get the shredded value for a key/value pair.

RETURNS DESCRIPTION
str

the shredded value

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def shredded_value(self) -> str:
    """
    Get the shredded value for a key/value pair.

    Returns
    -------
    str
        the shredded value
    """
    return self.static_value

gliner

GLiNERTransformer

GLiNERTransformer(
    labels: list[str],
    *,
    batch_size: int = 8,
    metadata_key_prefix: str = "",
    model: str | GLiNER = "urchade/gliner_mediumv2.1",
)

Bases: BaseDocumentTransformer

Add metadata to documents about named entities using GLiNER.

Extracts structured entity labels from text, identifying key attributes and categories to enrich document metadata with semantic information.

GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like).

Prerequisites

This transformer requires the gliner extra to be installed.

pip install -qU langchain_graph_retriever[gliner]
Example

An example of how to use this transformer exists HERE in the guide.

PARAMETER DESCRIPTION
labels

List of entity kinds to extract.

TYPE: list[str]

batch_size

The number of documents to process in each batch.

TYPE: int DEFAULT: 8

metadata_key_prefix

A prefix to add to metadata keys outputted by the extractor. This will be prepended to the label, with the value (or values) holding the generated keywords for that entity kind.

TYPE: str DEFAULT: ''

model

The GLiNER model to use. Pass the name of a model to load or pass an instantiated GLiNER model instance.

TYPE: str | GLiNER DEFAULT: 'urchade/gliner_mediumv2.1'

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/gliner.py
def __init__(
    self,
    labels: list[str],
    *,
    batch_size: int = 8,
    metadata_key_prefix: str = "",
    model: str | GLiNER = "urchade/gliner_mediumv2.1",
):
    if isinstance(model, GLiNER):
        self._model = model
    elif isinstance(model, str):
        self._model = GLiNER.from_pretrained(model)
    else:
        raise ValueError(f"Invalid model: {model}")

    self._batch_size = batch_size
    self._labels = labels
    self.metadata_key_prefix = metadata_key_prefix

html

HyperlinkTransformer

HyperlinkTransformer(
    *,
    url_metadata_key: str = "url",
    metadata_key: str = "hyperlink",
    drop_fragments: bool = True,
)

Bases: BaseDocumentTransformer

Extracts hyperlinks from HTML content and stores them in document metadata.

Prerequisites

This transformer requires the html extra to be installed.

pip install -qU langchain_graph_retriever[html]
Example

An example of how to use this transformer exists HERE in the guide.

PARAMETER DESCRIPTION
url_metadata_key

The metadata field containing the URL of the document. Must be set before transforming. Needed to resolve relative paths.

TYPE: str DEFAULT: 'url'

metadata_key

The metadata field to populate with documents linked from this content.

TYPE: str DEFAULT: 'hyperlink'

drop_fragments

Whether fragments in URLs and links should be dropped.

TYPE: bool DEFAULT: True

Notes

Expects each document to contain its URL in its metadata.

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/html.py
def __init__(
    self,
    *,
    url_metadata_key: str = "url",
    metadata_key: str = "hyperlink",
    drop_fragments: bool = True,
):
    self._url_metadata_key = url_metadata_key
    self._metadata_key = metadata_key
    self._drop_fragments = drop_fragments

keybert

KeyBERTTransformer

KeyBERTTransformer(
    *,
    batch_size: int = 8,
    metadata_key: str = "keywords",
    model: str | KeyBERT = "all-MiniLM-L6-v2",
)

Bases: BaseDocumentTransformer

Add metadata to documents about keywords using KeyBERT.

Extracts key topics and concepts from text, generating metadata that highlights the most relevant terms to describe the content.

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

Prerequisites

This transformer requires the keybert extra to be installed.

pip install -qU langchain_graph_retriever[keybert]
Example

An example of how to use this transformer exists HERE in the guide.

PARAMETER DESCRIPTION
batch_size

The number of documents to process in each batch.

TYPE: int DEFAULT: 8

metadata_key

The name of the key used in the metadata output.

TYPE: str DEFAULT: 'keywords'

model

The KeyBERT model to use. Pass the name of a model to load or pass an instantiated KeyBERT model instance.

TYPE: str | KeyBERT DEFAULT: 'all-MiniLM-L6-v2'

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/keybert.py
def __init__(
    self,
    *,
    batch_size: int = 8,
    metadata_key: str = "keywords",
    model: str | KeyBERT = "all-MiniLM-L6-v2",
):
    if isinstance(model, KeyBERT):
        self._kw_model = model
    elif isinstance(model, str):
        self._kw_model = KeyBERT(model=model)
    else:
        raise ValueError(f"Invalid model: {model}")
    self._batch_size = batch_size
    self._metadata_key = metadata_key

parent

ParentTransformer

ParentTransformer(
    *,
    path_metadata_key: str = "path",
    parent_metadata_key: str = "parent",
    path_delimiter: str = "\\",
)

Bases: BaseDocumentTransformer

Adds the hierarchal Parent path to the document metadata.

PARAMETER DESCRIPTION
path_metadata_key

Metadata key containing the path. This may correspond to paths in a file system, hierarchy in a document, etc.

TYPE: str DEFAULT: 'path'

parent_metadata_key

Metadata key for the added parent path

TYPE: str DEFAULT: 'parent'

path_delimiter

Delimiter of items in the path.

TYPE: str DEFAULT: '\\'

Example

An example of how to use this transformer exists HERE in the guide.

Notes

Expects each document to contain its path in its metadata.

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/parent.py
def __init__(
    self,
    *,
    path_metadata_key: str = "path",
    parent_metadata_key: str = "parent",
    path_delimiter: str = "\\",
):
    self._path_metadata_key = path_metadata_key
    self._parent_metadata_key = parent_metadata_key
    self._path_delimiter = path_delimiter

shredding

Shredding Transformer for sequence-based metadata fields.

ShreddingTransformer

ShreddingTransformer(
    *,
    keys: set[str] = set(),
    path_delimiter: str = DEFAULT_PATH_DELIMITER,
    static_value: Any = DEFAULT_STATIC_VALUE,
)

Bases: BaseDocumentTransformer

Shreds sequence-based metadata fields.

Certain vector stores do not support storing or searching on metadata fields with sequence-based values. This transformer converts sequence-based fields into simple metadata values.

Example

An example of how to use this transformer exists HERE in the guide.

PARAMETER DESCRIPTION
keys

A set of metadata keys to shred. If empty, all sequence-based fields will be shredded.

TYPE: set[str] DEFAULT: set()

path_delimiter

The path delimiter to use when building shredded keys.

TYPE: str DEFAULT: DEFAULT_PATH_DELIMITER

static_value

The value to set on each shredded key.

TYPE: Any DEFAULT: DEFAULT_STATIC_VALUE

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def __init__(
    self,
    *,
    keys: set[str] = set(),
    path_delimiter: str = DEFAULT_PATH_DELIMITER,
    static_value: Any = DEFAULT_STATIC_VALUE,
):
    self.keys = keys
    self.path_delimiter = path_delimiter
    self.static_value = static_value

restore_documents

restore_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Restore documents transformed by the ShreddingTransformer.

Restore documents transformed by the ShreddingTransformer back to their original state before shredding.

Note that any non-string values inside lists will be converted to strings after restoring.

Args: documents: A sequence of Documents to be transformed.

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Documents.

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def restore_documents(
    self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
    """
    Restore documents transformed by the ShreddingTransformer.

    Restore documents transformed by the ShreddingTransformer back to
    their original state before shredding.

    Note that any non-string values inside lists will be converted to strings
    after restoring.

    Args:
        documents: A sequence of Documents to be transformed.

    Returns
    -------
    Sequence[Document]
        A sequence of transformed Documents.
    """
    restored_docs = []
    for document in documents:
        new_doc = Document(id=document.id, page_content=document.page_content)
        shredded_keys = set(
            json.loads(document.metadata.pop(SHREDDED_KEYS_KEY, "[]"))
        )

        for key, value in document.metadata.items():
            # Check if the key belongs to a shredded group
            split_key = key.split(self.path_delimiter, 1)
            if (
                len(split_key) == 2
                and split_key[0] in shredded_keys
                and value == self.static_value
            ):
                original_key, original_value = split_key
                value = json.loads(original_value)
                if original_key not in new_doc.metadata:
                    new_doc.metadata[original_key] = []
                new_doc.metadata[original_key].append(value)
            else:
                # Retain non-shredded metadata as is
                new_doc.metadata[key] = value

        restored_docs.append(new_doc)

    return restored_docs

shredded_key

shredded_key(key: str, value: Any) -> str

Get the shredded key for a key/value pair.

PARAMETER DESCRIPTION
key

The metadata key to shred

TYPE: str

value

The metadata value to shred

TYPE: Any

RETURNS DESCRIPTION
str

the shredded key

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def shredded_key(self, key: str, value: Any) -> str:
    """
    Get the shredded key for a key/value pair.

    Parameters
    ----------
    key :
        The metadata key to shred
    value :
        The metadata value to shred

    Returns
    -------
    str
        the shredded key
    """
    return f"{key}{self.path_delimiter}{json.dumps(value)}"

shredded_value

shredded_value() -> str

Get the shredded value for a key/value pair.

RETURNS DESCRIPTION
str

the shredded value

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/shredding.py
def shredded_value(self) -> str:
    """
    Get the shredded value for a key/value pair.

    Returns
    -------
    str
        the shredded value
    """
    return self.static_value

spacy

SpacyNERTransformer

SpacyNERTransformer(
    *,
    include_labels: set[str] = set(),
    exclude_labels: set[str] = set(),
    limit: int | None = None,
    metadata_key: str = "entities",
    model: str | Language = "en_core_web_sm",
)

Bases: BaseDocumentTransformer

Add metadata to documents about named entities using spaCy.

Identifies and labels named entities in text, extracting structured metadata such as organizations, locations, dates, and other key entity types.

spaCy is a library for Natural Language Processing in Python. Here it is used for Named Entity Recognition (NER) to extract values from document content.

Prerequisites

This transformer requires the spacy extra to be installed.

pip install -qU langchain_graph_retriever[spacy]
Example

An example of how to use this transformer exists HERE in the guide.

PARAMETER DESCRIPTION
include_labels

Set of entity labels to include. Will include all labels if empty.

TYPE: set[str] DEFAULT: set()

exclude_labels

Set of entity labels to exclude. Will not exclude anything if empty.

TYPE: set[str] DEFAULT: set()

metadata_key

The metadata key to store the extracted entities in.

TYPE: str DEFAULT: 'entities'

model

The spaCy model to use. Pass the name of a model to load or pass an instantiated spaCy model instance.

TYPE: str | Language DEFAULT: 'en_core_web_sm'

Notes

See spaCy docs for the selected model to determine what NER labels will be used. The default model en_core_web_sm produces: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART.

Source code in packages/langchain-graph-retriever/src/langchain_graph_retriever/transformers/spacy.py
def __init__(
    self,
    *,
    include_labels: set[str] = set(),
    exclude_labels: set[str] = set(),
    limit: int | None = None,
    metadata_key: str = "entities",
    model: str | Language = "en_core_web_sm",
):
    self.include_labels = include_labels
    self.exclude_labels = exclude_labels
    self.limit = limit
    self.metadata_key = metadata_key

    if isinstance(model, str):
        if not spacy.util.is_package(model):
            spacy.cli.download(model)  # type: ignore
        self.model = spacy.load(model)
    elif isinstance(model, Language):
        self.model = model
    else:
        raise ValueError(f"Invalid model: {model}")