PDF

The PDF step helps you extract content from a PDF document.

Note The PDF extract feature is not supported on CentOS/Red Hat Enterprise Linux 7.x operating systems.

The Recorder View shows a single page of the PDF document tree and the extracted text. The robot can navigate through the document using the Next Page, Previous Page and Goto Page actions available on the Application Action menu. The menu is available when you right-click the application tab in the Recorder View.

Text extraction results depend on the internal data and structure of the PDF document. The text is split based on the formatting in the PDF document and the underlying accessibility of data and might include text outside the page boundaries or hidden by overlapping elements. If the required accessibility data is missing from (usually older) PDF documents, it might be necessary to use the Extract Text From Image step to extract the text using OCR.

The Extract text application action and the Extract text component action can be used to extract structured text from a specific area of the page.

Properties

Action
Select an action to perform using the PDF.
Document Source
  • Local File: specify the path to the file in the local file system in the File path field.

  • Robot File System: Specify the path to the file in the robot file system in the File path field.

  • Binary: Specify a variable or expression containing a PDF document in binary form.

Page number
Optionally specify the physical page to show after opening the document. If this property is not specified, the first page is shown.

Component actions

Action

Description

Extract text

Extracts text from the selected element of the PDF document.