langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer¶

class langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer[source]¶

Transform HTML content by extracting specific tags and removing unwanted ones.

Example

from langchain_community.document_transformers import BeautifulSoupTransformer

bs4_transformer = BeautifulSoupTransformer()
docs_transformed = bs4_transformer.transform_documents(docs)

Initialize the transformer.

This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.

Methods

__init__()

Initialize the transformer.

atransform_documents(documents, **kwargs)

Asynchronously transform a list of documents.

extract_tags(html_content, tags)

Extract specific tags from a given HTML content.

remove_unnecessary_lines(content)

Clean up the content by removing unnecessary lines.

remove_unwanted_tags(html_content, unwanted_tags)

Remove unwanted tags from a given HTML content.

transform_documents(documents[, ...])

Transform a list of Document objects by cleaning their HTML content.

__init__() None[source]¶

Initialize the transformer.

This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.

async atransform_documents(documents: Sequence[Document], **kwargs: Any) Sequence[Document][source]¶

Asynchronously transform a list of documents.

Parameters

documents – A sequence of Documents to be transformed.

Returns

A list of transformed Documents.

static extract_tags(html_content: str, tags: List[str]) str[source]¶

Extract specific tags from a given HTML content.

Parameters
  • html_content – The original HTML content string.

  • tags – A list of tags to be extracted from the HTML.

Returns

A string combining the content of the extracted tags.

static remove_unnecessary_lines(content: str) str[source]¶

Clean up the content by removing unnecessary lines.

Parameters

content – A string, which may contain unnecessary lines or spaces.

Returns

A cleaned string with unnecessary lines removed.

static remove_unwanted_tags(html_content: str, unwanted_tags: List[str]) str[source]¶

Remove unwanted tags from a given HTML content.

Parameters
  • html_content – The original HTML content string.

  • unwanted_tags – A list of tags to be removed from the HTML.

Returns

A cleaned HTML string with unwanted tags removed.

transform_documents(documents: Sequence[Document], unwanted_tags: List[str] = ['script', 'style'], tags_to_extract: List[str] = ['p', 'li', 'div', 'a'], remove_lines: bool = True, **kwargs: Any) Sequence[Document][source]¶

Transform a list of Document objects by cleaning their HTML content.

Parameters
  • documents – A sequence of Document objects containing HTML content.

  • unwanted_tags – A list of tags to be removed from the HTML.

  • tags_to_extract – A list of tags whose content will be extracted.

  • remove_lines – If set to True, unnecessary lines will be

  • content. (removed from the HTML) –

Returns

A sequence of Document objects with transformed content.

Examples using BeautifulSoupTransformer¶