Extract from PDF

This action extracts text and images from a PDF document contained as binary data in a selected binary variable.

Typically, the PDF document has been downloaded into the variable using an Extract Target step. The output from the "Extract from PDF" action is an HTML page containing the text and images extracted from the PDF document. In subsequent steps, the desired information can then be extracted from the page, in the same way as for other HTML pages.

Note the following:

  • PDF documents do not contain structure information such as tables or paragraphs, only positions of texts and graphics, that might or might not be positioned to look like tables or paragraphs. This can make it difficult to extract the desired information from PDF documents. However, the Extract from PDF step will apply some heuristics to group the text into HTML paragraphs based on the available position information.
  • The Extract from PDF step cannot extract data entered in forms. To make the form data available for extraction, you need to flatten the document using a third-party tool.

Properties

The "Extract Text from PDF" action can be configured using the following properties:

PDF Variable

The binary variable containing the PDF document as binary data.

Include Images

Specifies whether embedded images should be extracted. Note that not all images and graphics can be extracted from PDF documents; it depends on the way they have originally been embedded in the document.

Include Form XObjects

This option enables extraction of the Form XObjects from the PDF. Form XObjects groups objects within a PDF file. The objects may include text, images, vector elements, and etc. Form XObjects is usually used to store objects that are referenced multiple times within a document.

Include Positioning

Specifies whether the positions of the texts should be extracted. The positions may be useful to derive the structure of the document.

Include Formatting

Specifies whether the formatting (font names, sizes etc.) of the texts should be extracted. Like the positions, the formatting may be useful to derive the structure of the document.

Merge Text

As default the converter that generated the HTML from the PDF will merge text that is on the same line into one HTML element even if these are represented as different text in the PDF document. Though this may often desirable, it may in some cases have the effect that text that originally far apart will be merges together and appear to be right next to each other. A typical case where it would be desirable to turn this feature off is if the document contains more than one column. Turning the feature off will attempt to preserve the column structure.