langchain_community.document_loaders.mediawikidump
.MWDumpLoader¶
- class langchain_community.document_loaders.mediawikidump.MWDumpLoader(file_path: Union[str, Path], encoding: Optional[str] = 'utf8', namespaces: Optional[Sequence[int]] = None, skip_redirects: Optional[bool] = False, stop_on_error: Optional[bool] = True)[source]¶
Load MediaWiki dump from an XML file.
Example
from langchain_community.document_loaders import MWDumpLoader loader = MWDumpLoader( file_path="myWiki.xml", encoding="utf8" ) docs = loader.load() from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=0 ) texts = text_splitter.split_documents(docs)
- Parameters
file_path (str) – XML local file path
encoding (str, optional) – Charset encoding, defaults to “utf8”
namespaces (List[int],optional) – The namespace of pages you want to parse. See https://www.mediawiki.org/wiki/Help:Namespaces#Localisation for a list of all common namespaces
skip_redirects (bool, optional) – TR=rue to skip pages that redirect to other pages, False to keep them. False by default
stop_on_error (bool, optional) – False to skip over pages that cause parsing errors, True to stop. True by default
Methods
__init__
(file_path[, encoding, namespaces, ...])Lazy load from a file path.
load
()Load from a file path.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(file_path: Union[str, Path], encoding: Optional[str] = 'utf8', namespaces: Optional[Sequence[int]] = None, skip_redirects: Optional[bool] = False, stop_on_error: Optional[bool] = True)[source]¶
- load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document] ¶
Load Documents and split into chunks. Chunks are returned as Documents.
- Parameters
text_splitter – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns
List of Documents.