langchain_community.document_loaders.parsers.docai.DocAIParser¶
- class langchain_community.document_loaders.parsers.docai.DocAIParser(*, client: Optional[DocumentProcessorServiceClient] = None, location: Optional[str] = None, gcs_output_path: Optional[str] = None, processor_name: Optional[str] = None)[source]¶
- Google Cloud Document AI parser. - For a detailed explanation of Document AI, refer to the product documentation. https://cloud.google.com/document-ai/docs/overview - Initializes the parser. - Parameters
- client – a DocumentProcessorServiceClient to use 
- location – a Google Cloud location where a Document AI processor is located 
- gcs_output_path – a path on Google Cloud Storage to store parsing results 
- processor_name – full resource name of a Document AI processor or processor version 
 
 - You should provide either a client or location (and then a client
- would be instantiated). 
 - Methods - __init__(*[, client, location, ...])- Initializes the parser. - batch_parse(blobs[, gcs_output_path, ...])- Parses a list of blobs lazily. - docai_parse(blobs, *[, gcs_output_path, ...])- Runs Google Document AI PDF Batch Processing on a list of blobs. - get_results(operations)- is_running(operations)- lazy_parse(blob)- Parses a blob lazily. - online_process(blob[, ...])- Parses a blob lazily using online processing. - operations_from_names(operation_names)- Initializes Long-Running Operations from their names. - parse(blob)- Eagerly parse the blob into a document or documents. - parse_from_results(results)- __init__(*, client: Optional[DocumentProcessorServiceClient] = None, location: Optional[str] = None, gcs_output_path: Optional[str] = None, processor_name: Optional[str] = None)[source]¶
- Initializes the parser. - Parameters
- client – a DocumentProcessorServiceClient to use 
- location – a Google Cloud location where a Document AI processor is located 
- gcs_output_path – a path on Google Cloud Storage to store parsing results 
- processor_name – full resource name of a Document AI processor or processor version 
 
 - You should provide either a client or location (and then a client
- would be instantiated). 
 
 - batch_parse(blobs: Sequence[Blob], gcs_output_path: Optional[str] = None, timeout_sec: int = 3600, check_in_interval_sec: int = 60) Iterator[Document][source]¶
- Parses a list of blobs lazily. - Parameters
- blobs – a list of blobs to parse. 
- gcs_output_path – a path on Google Cloud Storage to store parsing results. 
- timeout_sec – a timeout to wait for Document AI to complete, in seconds. 
- check_in_interval_sec – an interval to wait until next check whether parsing operations have been completed, in seconds 
 
 - This is a long-running operation. A recommended way is to decouple
- parsing from creating LangChain Documents: >>> operations = parser.docai_parse(blobs, gcs_path) >>> parser.is_running(operations) You can get operations names and save them: >>> names = [op.operation.name for op in operations] And when all operations are finished, you can use their results: >>> operations = parser.operations_from_names(operation_names) >>> results = parser.get_results(operations) >>> docs = parser.parse_from_results(results) 
 
 - docai_parse(blobs: Sequence[Blob], *, gcs_output_path: Optional[str] = None, processor_name: Optional[str] = None, batch_size: int = 1000, enable_native_pdf_parsing: bool = True, field_mask: Optional[str] = None) List[Operation][source]¶
- Runs Google Document AI PDF Batch Processing on a list of blobs. - Parameters
- blobs – a list of blobs to be parsed 
- gcs_output_path – a path (folder) on GCS to store results 
- processor_name – name of a Document AI processor. 
- batch_size – amount of documents per batch 
- enable_native_pdf_parsing – a config option for the parser 
- field_mask – a comma-separated list of which fields to include in the Document AI response. suggested: “text,pages.pageNumber,pages.layout” 
 
 - Document AI has a 1000 file limit per batch, so batches larger than that need to be split into multiple requests. Batch processing is an async long-running operation and results are stored in a output GCS bucket. 
 - get_results(operations: List[Operation]) List[DocAIParsingResults][source]¶
 - lazy_parse(blob: Blob) Iterator[Document][source]¶
- Parses a blob lazily. - Parameters
- blobs – a Blob to parse 
 - This is a long-running operation. A recommended way is to batch
- documents together and use the batch_parse() method. 
 
 - online_process(blob: Blob, enable_native_pdf_parsing: bool = True, field_mask: Optional[str] = None, page_range: Optional[List[int]] = None) Iterator[Document][source]¶
- Parses a blob lazily using online processing. - Parameters
- blob – a blob to parse. 
- enable_native_pdf_parsing – enable pdf embedded text extraction 
- field_mask – a comma-separated list of which fields to include in the Document AI response. suggested: “text,pages.pageNumber,pages.layout” 
- page_range – list of page numbers to parse. If None, entire document will be parsed. 
 
 
 - operations_from_names(operation_names: List[str]) List[Operation][source]¶
- Initializes Long-Running Operations from their names. 
 - parse(blob: Blob) List[Document]¶
- Eagerly parse the blob into a document or documents. - This is a convenience method for interactive development environment. - Production applications should favor the lazy_parse method instead. - Subclasses should generally not over-ride this parse method. - Parameters
- blob – Blob instance 
- Returns
- List of documents 
 
 - parse_from_results(results: List[DocAIParsingResults]) Iterator[Document][source]¶