langchain_core.utils.html
.extract_sub_linksΒΆ
- langchain_core.utils.html.extract_sub_links(raw_html: str, url: str, *, base_url: Optional[str] = None, pattern: Optional[Union[str, Pattern]] = None, prevent_outside: bool = True, exclude_prefixes: Sequence[str] = ()) List[str] [source]ΒΆ
Extract all links from a raw html string and convert into absolute paths.
- Parameters
raw_html β original html.
url β the url of the html.
base_url β the base url to check for outside links against.
pattern β Regex to use for extracting links from raw html.
prevent_outside β If True, ignore external links which are not children of the base url.
exclude_prefixes β Exclude any URLs that start with one of these prefixes.
- Returns
sub links
- Return type
List[str]