langchain_core.utils.html
.extract_sub_linksΒΆ
- langchain_core.utils.html.extract_sub_links(raw_html: str, url: str, *, base_url: Optional[str] = None, pattern: Optional[Union[str, Pattern]] = None, prevent_outside: bool = True, exclude_prefixes: Sequence[str] = (), continue_on_failure: bool = False) List[str] [source]ΒΆ
Extract all links from a raw html string and convert into absolute paths.
- Parameters
raw_html (str) β original html.
url (str) β the url of the html.
base_url (Optional[str]) β the base url to check for outside links against.
pattern (Optional[Union[str, Pattern]]) β Regex to use for extracting links from raw html.
prevent_outside (bool) β If True, ignore external links which are not children of the base url.
exclude_prefixes (Sequence[str]) β Exclude any URLs that start with one of these prefixes.
continue_on_failure (bool) β If True, continue if parsing a specific link raises an exception. Otherwise, raise the exception.
- Returns
sub links
- Return type
List[str]