`langchain_community.document_loaders.pdf`.AmazonTextractPDFLoader¶

class langchain_community.document_loaders.pdf.AmazonTextractPDFLoader(file_path: str, textract_features: Optional[Sequence[str]] = None, client: Optional[Any] = None, credentials_profile_name: Optional[str] = None, region_name: Optional[str] = None, endpoint_url: Optional[str] = None, headers: Optional[Dict] = None, *, linearization_config: Optional[TextLinearizationConfig] = None)[source]¶

Load PDF files from a local file system, HTTP or S3.

To authenticate, the AWS client uses the following methods to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.

Make sure the credentials / roles used have the required policies to access the Amazon Textract service.

Example

Initialize the loader.

Parameters

file_path (str) – A file, url or s3 path for input file
textract_features (Optional[Sequence[str]]) – Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg
client (Optional[Any]) – boto3 textract client (Optional)
credentials_profile_name (Optional[str]) – AWS profile name, if not default (Optional)
region_name (Optional[str]) – AWS region, eg us-east-1 (Optional)
endpoint_url (Optional[str]) – endpoint url for the textract service (Optional)
linearization_config (Optional[TextLinearizationConfig]) – Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg
headers (Optional[Dict]) –

Attributes

source

Methods

`__init__`(file_path[, textract_features, ...])	Initialize the loader.
`alazy_load`()	A lazy loader for Documents.
`lazy_load`()	Lazy load documents
`load`()	Load given path as pages.
`load_and_split`([text_splitter])	Load Documents and split into chunks.

__init__(file_path: str, textract_features: Optional[Sequence[str]] = None, client: Optional[Any] = None, credentials_profile_name: Optional[str] = None, region_name: Optional[str] = None, endpoint_url: Optional[str] = None, headers: Optional[Dict] = None, *, linearization_config: Optional[TextLinearizationConfig] = None) → None[source]¶

Initialize the loader.

Parameters

file_path (str) – A file, url or s3 path for input file
textract_features (Optional[Sequence[str]]) – Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg
client (Optional[Any]) – boto3 textract client (Optional)
credentials_profile_name (Optional[str]) – AWS profile name, if not default (Optional)
region_name (Optional[str]) – AWS region, eg us-east-1 (Optional)
endpoint_url (Optional[str]) – endpoint url for the textract service (Optional)
linearization_config (Optional[TextLinearizationConfig]) – Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg
headers (Optional[Dict]) –

Return type

None

async alazy_load() → AsyncIterator[Document]¶

A lazy loader for Documents.

Return type: AsyncIterator[Document]

lazy_load() → Iterator[Document][source]¶

Lazy load documents

Return type: Iterator[Document]

load() → List[Document][source]¶

Load given path as pages.

Return type: List[Document]

load_and_split(text_splitter: Optional[TextSplitter] = None) → List[Document]¶

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters: text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
Returns: List of Documents.
Return type: List[Document]

Examples using AmazonTextractPDFLoader¶

Amazon Textract

langchain_community.document_loaders.pdf.AmazonTextractPDFLoader¶

Examples using AmazonTextractPDFLoader¶

`langchain_community.document_loaders.pdf`.AmazonTextractPDFLoader¶