• LangChain
  • Core
  • Community
  • Experimental
  • Text splitters
  • Partner libs
  • Docs
Prev Up Next
  • langchain_core.utils.html.extract_sub_links
    • extract_sub_links()

langchain_core.utils.html.extract_sub_linksΒΆ

langchain_core.utils.html.extract_sub_links(raw_html: str, url: str, *, base_url: Optional[str] = None, pattern: Optional[Union[str, Pattern]] = None, prevent_outside: bool = True, exclude_prefixes: Sequence[str] = (), continue_on_failure: bool = False) → List[str][source]ΒΆ

Extract all links from a raw html string and convert into absolute paths.

Parameters
  • raw_html (str) – original html.

  • url (str) – the url of the html.

  • base_url (Optional[str]) – the base url to check for outside links against.

  • pattern (Optional[Union[str, Pattern]]) – Regex to use for extracting links from raw html.

  • prevent_outside (bool) – If True, ignore external links which are not children of the base url.

  • exclude_prefixes (Sequence[str]) – Exclude any URLs that start with one of these prefixes.

  • continue_on_failure (bool) – If True, continue if parsing a specific link raises an exception. Otherwise, raise the exception.

Returns

sub links

Return type

List[str]

© 2023, LangChain, Inc.. Last updated on Mar 15, 2024.