langchain_community.document_loaders.recursive_url_loader
.RecursiveUrlLoader¶
- class langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader(url: str, max_depth: Optional[int] = 2, use_async: Optional[bool] = None, extractor: Optional[Callable[[str], str]] = None, metadata_extractor: Optional[Callable[[str, str], str]] = None, exclude_dirs: Optional[Sequence[str]] = (), timeout: Optional[int] = 10, prevent_outside: bool = True, link_regex: Optional[Union[str, Pattern]] = None, headers: Optional[dict] = None, check_response_status: bool = False)[source]¶
Load all child links from a URL page.
- Security Note: This loader is a crawler that will start crawling
at a given URL and then expand to crawl child links recursively.
Web crawlers should generally NOT be deployed with network access to any internal servers.
Control access to who can submit crawling requests and what network access the crawler has.
While crawling, the crawler may encounter malicious URLs that would lead to a server-side request forgery (SSRF) attack.
To mitigate risks, the crawler by default will only load URLs from the same domain as the start URL (controlled via prevent_outside named argument).
This will mitigate the risk of SSRF attacks, but will not eliminate it.
For example, if crawling a host which hosts several sites:
https://some_host/alice_site/ https://some_host/bob_site/
A malicious URL on Alice’s site could cause the crawler to make a malicious GET request to an endpoint on Bob’s site. Both sites are hosted on the same host, so such a request would not be prevented by default.
Initialize with URL to crawl and any subdirectories to exclude.
- Parameters
url – The URL to crawl.
max_depth – The max depth of the recursive loading.
use_async – Whether to use asynchronous loading. If True, this function will not be lazy, but it will still work in the expected way, just not lazy.
extractor – A function to extract document contents from raw html. When extract function returns an empty string, the document is ignored.
metadata_extractor – A function to extract metadata from raw html and the source url (args in that order). Default extractor will attempt to use BeautifulSoup4 to extract the title, description and language of the page.
exclude_dirs – A list of subdirectories to exclude.
timeout – The timeout for the requests, in the unit of seconds. If None then connection will not timeout.
prevent_outside – If True, prevent loading from urls which are not children of the root url.
link_regex – Regex for extracting sub-links from the raw html of a web page.
check_response_status – If True, check HTTP response status and skip URLs with error responses (400-599).
Methods
__init__
(url[, max_depth, use_async, ...])Initialize with URL to crawl and any subdirectories to exclude.
Lazy load web pages.
load
()Load web pages.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(url: str, max_depth: Optional[int] = 2, use_async: Optional[bool] = None, extractor: Optional[Callable[[str], str]] = None, metadata_extractor: Optional[Callable[[str, str], str]] = None, exclude_dirs: Optional[Sequence[str]] = (), timeout: Optional[int] = 10, prevent_outside: bool = True, link_regex: Optional[Union[str, Pattern]] = None, headers: Optional[dict] = None, check_response_status: bool = False) None [source]¶
Initialize with URL to crawl and any subdirectories to exclude.
- Parameters
url – The URL to crawl.
max_depth – The max depth of the recursive loading.
use_async – Whether to use asynchronous loading. If True, this function will not be lazy, but it will still work in the expected way, just not lazy.
extractor – A function to extract document contents from raw html. When extract function returns an empty string, the document is ignored.
metadata_extractor – A function to extract metadata from raw html and the source url (args in that order). Default extractor will attempt to use BeautifulSoup4 to extract the title, description and language of the page.
exclude_dirs – A list of subdirectories to exclude.
timeout – The timeout for the requests, in the unit of seconds. If None then connection will not timeout.
prevent_outside – If True, prevent loading from urls which are not children of the root url.
link_regex – Regex for extracting sub-links from the raw html of a web page.
check_response_status – If True, check HTTP response status and skip URLs with error responses (400-599).
- lazy_load() Iterator[Document] [source]¶
Lazy load web pages. When use_async is True, this function will not be lazy, but it will still work in the expected way, just not lazy.
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document] ¶
Load Documents and split into chunks. Chunks are returned as Documents.
- Parameters
text_splitter – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.