langchain_community.document_loaders.pdf.MathpixPDFLoader

class langchain_community.document_loaders.pdf.MathpixPDFLoader(file_path: str, processed_file_format: str = 'md', max_wait_time_seconds: int = 500, should_clean_pdf: bool = False, extra_request_data: Optional[Dict[str, Any]] = None, **kwargs: Any)[source]

Load PDF files using Mathpix service.

Initialize with a file path.

Parameters
  • file_path – a file for loading.

  • processed_file_format – a format of the processed file. Default is “md”.

  • max_wait_time_seconds – a maximum time to wait for the response from the server. Default is 500.

  • should_clean_pdf – a flag to clean the PDF file. Default is False.

  • extra_request_data – Additional request data.

  • **kwargs – additional keyword arguments.

Attributes

data

source

url

Methods

__init__(file_path[, processed_file_format, ...])

Initialize with a file path.

clean_pdf(contents)

Clean the PDF file.

get_processed_pdf(pdf_id)

lazy_load()

A lazy loader for Documents.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

send_pdf()

wait_for_processing(pdf_id)

Wait for processing to complete.

__init__(file_path: str, processed_file_format: str = 'md', max_wait_time_seconds: int = 500, should_clean_pdf: bool = False, extra_request_data: Optional[Dict[str, Any]] = None, **kwargs: Any) None[source]

Initialize with a file path.

Parameters
  • file_path – a file for loading.

  • processed_file_format – a format of the processed file. Default is “md”.

  • max_wait_time_seconds – a maximum time to wait for the response from the server. Default is 500.

  • should_clean_pdf – a flag to clean the PDF file. Default is False.

  • extra_request_data – Additional request data.

  • **kwargs – additional keyword arguments.

clean_pdf(contents: str) str[source]

Clean the PDF file.

Parameters

contents – a PDF file contents.

Returns:

get_processed_pdf(pdf_id: str) str[source]
lazy_load() Iterator[Document]

A lazy loader for Documents.

load() List[Document][source]

Load data into Document objects.

load_and_split(text_splitter: Optional[TextSplitter] = None) List[Document]

Load Documents and split into chunks. Chunks are returned as Documents.

Parameters

text_splitter – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns

List of Documents.

send_pdf() str[source]
wait_for_processing(pdf_id: str) None[source]

Wait for processing to complete.

Parameters

pdf_id – a PDF id.

Returns: None