Conflict management

Document training inside Project Builder usually includes additional testing and bench marks. Document training during validation cannot use these additional steps. To minimize the risk of extraction errors due to incorrectly trained documents, additional precautions are implemented. A document for a specific customer should be trained twice during online learning to allow a confident extraction. After the first trained document, the document fields for documents of the same layout are extracted with a lower confidence. These fields are invalid and an error description is displayed in the Current Error area. The best practice is that you confirm the field and mark the document for extraction online learning a second time.

After a document is trained two or more times in the same way, other documents are extracted with a high confidence and therefore the field status in validation is no longer invalid.

Table 1. Normally Trained Documents without Errors
Number of correctly trained documents Result Confidence
0 Correct 50 % (not confident)
1 Correct 85 % (confident)
2 Correct 90 % (confident)
>=3 Correct 100 % (confident)

A document with incorrect training data can cause problems in subsequent batches. As a result a document with the same layout is incorrectly extracted based on faulty training data. Due to the lower confidence after the first training, the Validation user can correct the error and train the correct values for this document by marking it for extraction online learning. Now the specific training algorithm recognizes a so-called conflict.

Note Fields relevant for training that are not printed on the document cannot be solved within the Resolve Conflicts window. For example, the vendor ID that is returned as a result by the Vendor Locator is not printed on the document. To resolve the conflict the Edit Document window is displayed.

The algorithm counts the number of documents trained for each version of the field position. For the next extraction the field position is chosen from the trained samples. The final field confidence depends on the number of sample documents for the correctly and the incorrectly trained version.

Table 2. Training with an Initially Incorrect Field Value
Number of correctly trained documents Number of incorrectly trained documents Result Confidence
0 1 Incorrect 50 % (not confident)
1 1 Incorrect 40 % (not confident)
2 1 Correct 60 % (not confident)
3 1 Correct 80 % (confident)
4 1 Correct 85 % (confident)
>4 1 Correct 90 % (confident)

As long as a field is invalid, a modified icon and an error description is displayed for the document in the training subset.

You can resolve a conflict in the Resolve Conflicts window. You do this by deleting a incorrectly trained field or document, or by correcting the field position. To improve the extraction results for those documents, you also can confirm the field value by using the Edit Document window.

By eliminating a conflict between two documents, it is possible that other conflicts are resolved indirectly. The Resolve Conflicts window skips the documents that are no longer in conflict, and the current status is displayed in the status bar.

Important Conflicting fields can only be found for documents that have a layout similarity that is 80% or greater. It is possible that two documents from the same vendor look the same to the user, but actually have a layout similarity of less than 80%. These documents are internally handled separately. As a result, these documents cannot be compared and conflicts are not displayed, even if it seems that there are conflicting fields.

If a conflict is caused by contradicting table fields, you must delete the conflicting document, as only one document can be used for training the table layout. If however, you want to use this document for training other extraction fields, you must open the Edit Document window and skip training for the table fields by clearing the check box beside the Table Definition button.