langchain_community.document_loaders.sitemap
.SitemapLoader¶
- class langchain_community.document_loaders.sitemap.SitemapLoader(web_path: str, filter_urls: Optional[List[str]] = None, parsing_function: Optional[Callable] = None, blocksize: Optional[int] = None, blocknum: int = 0, meta_function: Optional[Callable] = None, is_local: bool = False, continue_on_failure: bool = False, restrict_to_same_domain: bool = True, **kwargs: Any)[source]¶
Load a sitemap and its URLs.
- Security Note: This loader can be used to load all URLs specified in a sitemap.
If a malicious actor gets access to the sitemap, they could force the server to load URLs from other domains by modifying the sitemap. This could lead to server-side request forgery (SSRF) attacks; e.g., with the attacker forcing the server to load URLs from internal service endpoints that are not publicly accessible. While the attacker may not immediately gain access to this data, this data could leak into downstream systems (e.g., data loader is used to load data for indexing).
This loader is a crawler and web crawlers should generally NOT be deployed with network access to any internal servers.
Control access to who can submit crawling requests and what network access the crawler has.
By default, the loader will only load URLs from the same domain as the sitemap if the site map is not a local file. This can be disabled by setting restrict_to_same_domain to False (not recommended).
If the site map is a local file, no such risk mitigation is applied by default.
Use the filter URLs argument to limit which URLs can be loaded.
Initialize with webpage path and optional filter URLs.
- Parameters
web_path (str) – url of the sitemap. can also be a local path
filter_urls (Optional[List[str]]) – a list of regexes. If specified, only URLS that match one of the filter URLs will be loaded. WARNING The filter URLs are interpreted as regular expressions. Remember to escape special characters if you do not want them to be interpreted as regular expression syntax. For example, . appears frequently in URLs and should be escaped if you want to match a literal . rather than any character. restrict_to_same_domain takes precedence over filter_urls when restrict_to_same_domain is True and the sitemap is not a local file.
parsing_function (Optional[Callable]) – Function to parse bs4.Soup output
blocksize (Optional[int]) – number of sitemap locations per block
blocknum (int) – the number of the block that should be loaded - zero indexed. Default: 0
meta_function (Optional[Callable]) – Function to parse bs4.Soup output for metadata remember when setting this method to also copy metadata[“loc”] to metadata[“source”] if you are using this field
is_local (bool) – whether the sitemap is a local file. Default: False
continue_on_failure (bool) – whether to continue loading the sitemap if an error occurs loading a url, emitting a warning instead of raising an exception. Setting this to True makes the loader more robust, but also may result in missing data. Default: False
restrict_to_same_domain (bool) – whether to restrict loading to URLs to the same domain as the sitemap. Attention: This is only applied if the sitemap is not a local file!
kwargs (Any) –
Attributes
web_path
Methods
__init__
(web_path[, filter_urls, ...])Initialize with webpage path and optional filter URLs.
A lazy loader for Documents.
aload
()Load text from the urls in web_path async into Documents.
fetch_all
(urls)Fetch all urls concurrently with rate limiting.
Load sitemap.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
parse_sitemap
(soup)Parse sitemap xml and load into a list of dicts.
scrape
([parser])Scrape data from webpage and return it in BeautifulSoup format.
scrape_all
(urls[, parser])Fetch all urls, then return soups for all results.
- __init__(web_path: str, filter_urls: Optional[List[str]] = None, parsing_function: Optional[Callable] = None, blocksize: Optional[int] = None, blocknum: int = 0, meta_function: Optional[Callable] = None, is_local: bool = False, continue_on_failure: bool = False, restrict_to_same_domain: bool = True, **kwargs: Any)[source]¶
Initialize with webpage path and optional filter URLs.
- Parameters
web_path (str) – url of the sitemap. can also be a local path
filter_urls (Optional[List[str]]) – a list of regexes. If specified, only URLS that match one of the filter URLs will be loaded. WARNING The filter URLs are interpreted as regular expressions. Remember to escape special characters if you do not want them to be interpreted as regular expression syntax. For example, . appears frequently in URLs and should be escaped if you want to match a literal . rather than any character. restrict_to_same_domain takes precedence over filter_urls when restrict_to_same_domain is True and the sitemap is not a local file.
parsing_function (Optional[Callable]) – Function to parse bs4.Soup output
blocksize (Optional[int]) – number of sitemap locations per block
blocknum (int) – the number of the block that should be loaded - zero indexed. Default: 0
meta_function (Optional[Callable]) – Function to parse bs4.Soup output for metadata remember when setting this method to also copy metadata[“loc”] to metadata[“source”] if you are using this field
is_local (bool) – whether the sitemap is a local file. Default: False
continue_on_failure (bool) – whether to continue loading the sitemap if an error occurs loading a url, emitting a warning instead of raising an exception. Setting this to True makes the loader more robust, but also may result in missing data. Default: False
restrict_to_same_domain (bool) – whether to restrict loading to URLs to the same domain as the sitemap. Attention: This is only applied if the sitemap is not a local file!
kwargs (Any) –
- async alazy_load() AsyncIterator[Document] ¶
A lazy loader for Documents.
- Return type
AsyncIterator[Document]
- aload() List[Document] ¶
Load text from the urls in web_path async into Documents.
- Return type
List[Document]
- async fetch_all(urls: List[str]) Any ¶
Fetch all urls concurrently with rate limiting.
- Parameters
urls (List[str]) –
- Return type
Any
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document] ¶
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.
- Return type
List[Document]
- parse_sitemap(soup: Any) List[dict] [source]¶
Parse sitemap xml and load into a list of dicts.
- Parameters
soup (Any) – BeautifulSoup object.
- Returns
List of dicts.
- Return type
List[dict]
- scrape(parser: Optional[str] = None) Any ¶
Scrape data from webpage and return it in BeautifulSoup format.
- Parameters
parser (Optional[str]) –
- Return type
Any
- scrape_all(urls: List[str], parser: Optional[str] = None) List[Any] ¶
Fetch all urls, then return soups for all results.
- Parameters
urls (List[str]) –
parser (Optional[str]) –
- Return type
List[Any]