`langchain_community.document_loaders.parsers.docai`.DocAIParser¶

class langchain_community.document_loaders.parsers.docai.DocAIParser(*, client: Optional[DocumentProcessorServiceClient] = None, location: Optional[str] = None, gcs_output_path: Optional[str] = None, processor_name: Optional[str] = None)[source]¶

Google Cloud Document AI parser.

For a detailed explanation of Document AI, refer to the product documentation. https://cloud.google.com/document-ai/docs/overview

Initializes the parser.

Parameters

client – a DocumentProcessorServiceClient to use
location – a Google Cloud location where a Document AI processor is located
gcs_output_path – a path on Google Cloud Storage to store parsing results
processor_name – full resource name of a Document AI processor or processor version

You should provide either a client or location (and then a client: would be instantiated).

Methods

`__init__`(*[, client, location, ...])	Initializes the parser.
`batch_parse`(blobs[, gcs_output_path, ...])	Parses a list of blobs lazily.
`docai_parse`(blobs, *[, gcs_output_path, ...])	Runs Google Document AI PDF Batch Processing on a list of blobs.
`get_results`(operations)
`is_running`(operations)
`lazy_parse`(blob)	Parses a blob lazily.
`online_process`(blob[, ...])	Parses a blob lazily using online processing.
`operations_from_names`(operation_names)	Initializes Long-Running Operations from their names.
`parse`(blob)	Eagerly parse the blob into a document or documents.
`parse_from_results`(results)

__init__(*, client: Optional[DocumentProcessorServiceClient] = None, location: Optional[str] = None, gcs_output_path: Optional[str] = None, processor_name: Optional[str] = None)[source]¶

Initializes the parser.

Parameters

client – a DocumentProcessorServiceClient to use
location – a Google Cloud location where a Document AI processor is located
gcs_output_path – a path on Google Cloud Storage to store parsing results
processor_name – full resource name of a Document AI processor or processor version

You should provide either a client or location (and then a client: would be instantiated).

batch_parse(blobs: Sequence[Blob], gcs_output_path: Optional[str] = None, timeout_sec: int = 3600, check_in_interval_sec: int = 60) → Iterator[Document][source]¶

Parses a list of blobs lazily.

Parameters

blobs – a list of blobs to parse.
gcs_output_path – a path on Google Cloud Storage to store parsing results.
timeout_sec – a timeout to wait for Document AI to complete, in seconds.
check_in_interval_sec – an interval to wait until next check whether parsing operations have been completed, in seconds

This is a long-running operation. A recommended way is to decouple: parsing from creating LangChain Documents: >>> operations = parser.docai_parse(blobs, gcs_path) >>> parser.is_running(operations) You can get operations names and save them: >>> names = [op.operation.name for op in operations] And when all operations are finished, you can use their results: >>> operations = parser.operations_from_names(operation_names) >>> results = parser.get_results(operations) >>> docs = parser.parse_from_results(results)

docai_parse(blobs: Sequence[Blob], *, gcs_output_path: Optional[str] = None, processor_name: Optional[str] = None, batch_size: int = 1000, enable_native_pdf_parsing: bool = True, field_mask: Optional[str] = None) → List[Operation][source]¶

Runs Google Document AI PDF Batch Processing on a list of blobs.

Parameters

blobs – a list of blobs to be parsed
gcs_output_path – a path (folder) on GCS to store results
processor_name – name of a Document AI processor.
batch_size – amount of documents per batch
enable_native_pdf_parsing – a config option for the parser
field_mask – a comma-separated list of which fields to include in the Document AI response. suggested: “text,pages.pageNumber,pages.layout”

Document AI has a 1000 file limit per batch, so batches larger than that need to be split into multiple requests. Batch processing is an async long-running operation and results are stored in a output GCS bucket.

get_results(operations: List[Operation]) → List[DocAIParsingResults][source]¶

is_running(operations: List[Operation]) → bool[source]¶

lazy_parse(blob: Blob) → Iterator[Document][source]¶

Parses a blob lazily.

Parameters: blobs – a Blob to parse

This is a long-running operation. A recommended way is to batch: documents together and use the batch_parse() method.

online_process(blob: Blob, enable_native_pdf_parsing: bool = True, field_mask: Optional[str] = None, page_range: Optional[List[int]] = None) → Iterator[Document][source]¶

Parses a blob lazily using online processing.

Parameters

blob – a blob to parse.
enable_native_pdf_parsing – enable pdf embedded text extraction
field_mask – a comma-separated list of which fields to include in the Document AI response. suggested: “text,pages.pageNumber,pages.layout”
page_range – list of page numbers to parse. If None, entire document will be parsed.

operations_from_names(operation_names: List[str]) → List[Operation][source]¶: Initializes Long-Running Operations from their names.

parse(blob: Blob) → List[Document]¶

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters: blob – Blob instance
Returns: List of documents

parse_from_results(results: List[DocAIParsingResults]) → Iterator[Document][source]¶

Examples using DocAIParser¶

docai.md

langchain_community.document_loaders.parsers.docai.DocAIParser¶

Examples using DocAIParser¶

`langchain_community.document_loaders.parsers.docai`.DocAIParser¶