PDF files

Kofax Transformation Modules support the PDF/A file format for document processing.

Emails often contain electronic documents as attachments that provide additional information that cannot be stored with images. This additional information can be unstructured full text and also structured text information. By converting the email attachments into PDF/A documents it is possible to keep all of the information from these attachments.

PDF/A is a file format for long-term archiving of electronic documents. It is based on the PDF Reference Version 1.4 from Adobe Systems Inc. PDF/A is in fact a subset of PDF, obtained by omitting PDF features not suited to long-term archiving. As a PDF/A document embeds all fonts that it uses, a PDF/A file often is bigger than an equivalent PDF file that does not have the fonts embedded.

Note PDF files should use a resolution of 300 DPI in order to be recognized correctly. Other resolutions may cause incorrect extraction and classification results.

Unlike single text or image files that need to be grouped into documents, PDF files are already separated into the correct document length. As a result, several batch editing operations are not available for PDF files.

Important Almost all editing functions that change a document are disabled for PDF documents. Therefore, you cannot delete pages, split a PDF document, or merge two PDF documents. However, you can rotate pages and reject, move, delete, or add sticky notes to PDF documents.

Batch restructuring via script is also restricted to events that do not change the physical structure of a PDF document. As with batch editing, you can still reject, move, delete or add sticky notes to PDF documents via script.

Since PDF files cannot be merged or separated, PDF files cannot be used to test separation or to generate separation benchmark statistics. During separation benchmark testing, the documents in the selected document set are broken into individual pages, then merged into classified documents using the project separation settings. Since PDF files cannot be split, they cannot be used in this manner. If you attempt to generate separation benchmark statistics for a set of documents that include one or more PDF files, you receive an error.

Normally, PDF files have the full text embedded in the file. Kofax Transformation Modules is able to use this embedded text so there is no need to run recognition. This embedded text can then be used for extraction, content classification, and other purposes just like regular recognition results.

If a PDF has document security restrictions, these are not a problem for Kofax Transformation Modules and the document still extracts successfully. The following list shows examples of security restrictions that are put on PDFs that Kofax Transformation Modules can handle and still successfully extract a PDF document.

  • printing

  • document assembly

  • content copying for accessibility

  • page extraction

  • commenting

  • filling of form fields

  • signing

  • creation of template pages