Thresholds, precision, and recall

The overall quality of the classification process can be expressed by precision and recall. The classification of a document, when compared with a reference set, leads to one of three results:

  • Correct classification

  • Incorrect classification (also known as a false positive or substitution)

  • No classification (or rejections)

A threshold ensures that all classification results below a certain confidence level are suppressed. The confidence is the degree of concurrence between the document and a chosen class.

Two types of thresholds can be defined:

Absolute threshold

Absolute value (expressed as a percentage) indicating the minimum necessary concurrence of a document with a class for an accepted result. A classification process might return a confidence of 73% as a best result, and is accepted as a final result if the threshold setting is 73% or lower. Otherwise, the result is rejected and the document remains unclassified, unless there is a default class.

Relative distance

The minimum required difference between the confidences of the best result and the second best result for the class to be accepted as the classification result. For example, a classification process might return confidences of 73% for the best class and 62% for the second best class. If the required relative distance is set to 11% or smaller, the result is accepted (if the absolute threshold criteria is also fulfilled); otherwise, it is rejected and the document is left unclassified, unless there is a default class.

If more than one class is defined, you can specify a minimum difference between the best result and the next best result to get a unique classification result. If you accept multiple results, you do not need a relative distance. However, Kofax Transformation Modules is designed to determine a unique class as the classification result.

Precision is the percentage of all correctly classified documents versus all classified documents. Recall is the percentage of documents that are correctly classified versus documents that should are classified.

The blue area in the image that follows depicts the set of all documents. The vertical reference line divides this set of documents into two groups: Class A or Not Class A. The classifier decides if a document belongs to Class A or Not Class A. This is depicted by the diagonal line. If the classifier and the reference set were perfect, the vertical line and the diagonal line would exactly match. Since this is not the case, three subsets are created by the intersection of these two lines:

  • The a group is the subset of correctly classified documents.

  • The b group is the subset of incorrectly classified documents.

  • The c group is the subset of documents that are not classified.

If there is more than one class, the weighted values of Precision (P) and Recall (R) are added over all classes to get an overall result. If no threshold is defined, P and R are equal since every incorrectly classified document is missing in another class.

If a threshold is introduced, a third set of rejected documents is created that is not shown in the graph. A threshold increases precision while lowering recall by suppressing incorrectly classified documents.

Determine P and R for your classification scheme using the Result Matrix tool in Project Builder. Use the interactive threshold setting tool to set the system to the required precision for the reference set.

Figure 1. The Relationship between Precision and Recall
An image that shows the relationship between precision and recall.
Figure 2. Mathematical Formulas to Calculate Precision (P) and Recall (R)
An image that shows that precision (P) can be calculated using the formula P=(a/(a+b)), and recall (R) can be calculated using R=(a/(a+c)).