How pages are classified for Trainable Document Separation

Classifying a document with multiple pages using trainable document separation means that each individual page in the batch is classified and assigned a confidence value.

Because of this, the classification settings configured on the Project Settings - Classification tab are important. You must have at least one classification method enabled; do not select "Do not use content classification" and "Do not use layout classification". This is also important if you do not have the necessary licensing to use content classification.

When the first Server instance runs, each page in the batch is classified, and depending on these page classification results, the documents are separated and classified.

Pages can be classified as:

  • A start page of a document.

  • A middle page of a document.

  • A last page of a document.

Typically, single page documents are classified as a start page with a high confidence.

Individual pages in the batch are processed and given possible page classification results. These results are compared to the surrounding documents. The possible page classification results are evaluated to determine the most logical way of separating the pages into documents. For example, if a middle page is surrounded by a start and end page, it is likely that the document is a three-page document.

The following example shows how the individual pages in a batch are classified. The page classification confidence levels help determine where to separate documents, and what the final classification result should be.

Figure 1. Classified Page Confidences before Separation
An image that shows the classification confidences for each page in a batch.

This example shows two possible classification results for the last two pages in the batch. The green path has a higher confidence value for both pages; this is the most logical way to separate the documents.

Figure 2. Final Document Separation and Classification Results
An example that shows the final document separation and classification results

The separation settings indicate that there must be at least 10% difference between the highest level of confidence and the next highest level of confidence. In this case, the fourth document can either be an Invoice_start or an Order_start. Since the confidence level of the Invoice_start page classification result is higher than the Order_start page classification result, and there is more than 10% difference between these two values, the most likely document classification result is an Invoice.

If the difference was less than 10%, a page classification conflict occurs, and the document would be displayed in Document Review so that a user could verify classification and separation.