langchain_community.document_loaders.url_playwright
.PlaywrightURLLoader¶
- class langchain_community.document_loaders.url_playwright.PlaywrightURLLoader(urls: List[str], continue_on_failure: bool = True, headless: bool = True, remove_selectors: Optional[List[str]] = None, evaluator: Optional[PlaywrightEvaluator] = None, proxy: Optional[Dict[str, str]] = None)[source]¶
Load HTML pages with Playwright and parse with Unstructured.
This is useful for loading pages that require javascript to render.
- Parameters
urls (List[str]) –
continue_on_failure (bool) –
headless (bool) –
remove_selectors (Optional[List[str]]) –
evaluator (Optional[PlaywrightEvaluator]) –
proxy (Optional[Dict[str, str]]) –
- urls¶
List of URLs to load.
- Type
List[str]
- continue_on_failure¶
If True, continue loading other URLs on failure.
- Type
bool
- headless¶
If True, the browser will run in headless mode.
- Type
bool
- proxy¶
If set, the browser will access URLs through the specified proxy.
- Type
Optional[Dict[str, str]]
Example
from langchain_community.document_loaders import PlaywrightURLLoader urls = ["https://api.ipify.org/?format=json",] proxy={ "server": "https://xx.xx.xx:15818", # https://<host>:<port> "username": "username", "password": "password" } loader = PlaywrightURLLoader(urls, proxy=proxy) data = loader.load()
Load a list of URLs using Playwright.
Methods
__init__
(urls[, continue_on_failure, ...])Load a list of URLs using Playwright.
Load the specified URLs with Playwright and create Documents asynchronously.
aload
()Load the specified URLs with Playwright and create Documents asynchronously.
Load the specified URLs using Playwright and create Document instances.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(urls: List[str], continue_on_failure: bool = True, headless: bool = True, remove_selectors: Optional[List[str]] = None, evaluator: Optional[PlaywrightEvaluator] = None, proxy: Optional[Dict[str, str]] = None)[source]¶
Load a list of URLs using Playwright.
- Parameters
urls (List[str]) –
continue_on_failure (bool) –
headless (bool) –
remove_selectors (Optional[List[str]]) –
evaluator (Optional[PlaywrightEvaluator]) –
proxy (Optional[Dict[str, str]]) –
- async alazy_load() AsyncIterator[Document] [source]¶
Load the specified URLs with Playwright and create Documents asynchronously. Use this function when in a jupyter notebook environment.
- Returns
A list of Document instances with loaded content.
- Return type
AsyncIterator[Document]
- async aload() List[Document] [source]¶
Load the specified URLs with Playwright and create Documents asynchronously. Use this function when in a jupyter notebook environment.
- Returns
A list of Document instances with loaded content.
- Return type
List[Document]
- lazy_load() Iterator[Document] [source]¶
Load the specified URLs using Playwright and create Document instances.
- Returns
A list of Document instances with loaded content.
- Return type
Iterator[Document]
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document] ¶
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.
- Return type
List[Document]