langchain.text_splitter.HTMLHeaderTextSplitter

class langchain.text_splitter.HTMLHeaderTextSplitter(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]

Splitting HTML files based on specified headers. Requires lxml package.

Create a new HTMLHeaderTextSplitter.

Parameters
  • headers_to_split_on – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2)].

  • return_each_element – Return each element w/ associated headers.

Methods

__init__(headers_to_split_on[, ...])

Create a new HTMLHeaderTextSplitter.

aggregate_elements_to_chunks(elements)

Combine elements with common metadata into chunks

split_text(text)

Split HTML text string

split_text_from_file(file)

Split HTML file

split_text_from_url(url)

Split HTML from web URL

__init__(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]

Create a new HTMLHeaderTextSplitter.

Parameters
  • headers_to_split_on – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2)].

  • return_each_element – Return each element w/ associated headers.

aggregate_elements_to_chunks(elements: List[ElementType]) List[Document][source]

Combine elements with common metadata into chunks

Parameters

elements – HTML element content with associated identifying info and metadata

split_text(text: str) List[Document][source]

Split HTML text string

Parameters

text – HTML text

split_text_from_file(file: Any) List[Document][source]

Split HTML file

Parameters

file – HTML file

split_text_from_url(url: str) List[Document][source]

Split HTML from web URL

Parameters

url – web URL