langchain.text_splitter
.HTMLHeaderTextSplitter¶
- class langchain.text_splitter.HTMLHeaderTextSplitter(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]¶
Splitting HTML files based on specified headers. Requires lxml package.
Create a new HTMLHeaderTextSplitter.
- Parameters
headers_to_split_on – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2)].
return_each_element – Return each element w/ associated headers.
Methods
__init__
(headers_to_split_on[, ...])Create a new HTMLHeaderTextSplitter.
aggregate_elements_to_chunks
(elements)Combine elements with common metadata into chunks
split_text
(text)Split HTML text string
split_text_from_file
(file)Split HTML file
split_text_from_url
(url)Split HTML from web URL
- __init__(headers_to_split_on: List[Tuple[str, str]], return_each_element: bool = False)[source]¶
Create a new HTMLHeaderTextSplitter.
- Parameters
headers_to_split_on – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2)].
return_each_element – Return each element w/ associated headers.
- aggregate_elements_to_chunks(elements: List[ElementType]) List[Document] [source]¶
Combine elements with common metadata into chunks
- Parameters
elements – HTML element content with associated identifying info and metadata