How pages are classified for Trainable Document Separation

Classifying a document with multiple pages using trainable document separation means that each individual page is classified and assigned a confidence value.

Because of this, the classification settings configured on the Project Settings - Classification tab are important. You must have at least one classification method enabled; do not select both "Do not use content classification" and "Do not use layout classification". This is also important if you do not have the necessary licensing to use content classification.

When the first Transformation Server instance runs, each page is classified, and depending on these page classification results, the documents are separated and classified.

Pages can be classified as:

  • A start page of a document.

  • A middle page of a document.

  • A last page of a document.

Typically, single page documents are classified as a start page with a high confidence.

Individual pages in the content are processed and given possible page classification results.

These results are compared to the surrounding documents. The possible page classification results are evaluated to determine the most logical way of separating the pages into documents. For example, if a middle page is surrounded by a start and end page, it is likely that the document is a three-page document.

The separation settings indicate that there must be at least 10% difference between the highest level of confidence and the next highest level of confidence.

If the difference is less than 10%, a page classification conflict occurs, and the document would be displayed in Document Review so that a user could verify classification and separation.