Set up classification

In document capture, classification is the assignment of a document to a category or a class. This category is predefined based on your project class hierarchy. Without classification, successful extraction or archiving is impossible.

Manual classification typically follows a hierarchical scheme. First, the main category of a document is determined. The classification is further refined over several steps until the final document category is determined. Kofax Transformation Modules enables you to replicate your manual classification hierarchy scheme so that automatic classification achieves the same results.

To configure automatic classification, a class hierarchy is needed. This hierarchy is created and maintained in the Project Tree. Each class added to the project tree can represent a possible classification result. This project tree in combination with the project classification settings determine the classification result of documents in production.

A document can be classified based on physical layout or content, and the classification order of processing determines the final classification result. You can use a combination of pre-production trained documents and classification instructions, or you can also use Classification Online Learning, that collects training documents for use in classification while a project is in production. The latter ensures that any new documents or classes are absorbed by the project easily, without a lot of configuration.

To aid in classification, first perform Clustering on a set of documents, and then add the pre-classified documents to your Classification Training document set so that classification can learn by example. You assign sample documents for each class. When the project is trained, the sample documents are analyzed and important features are extracted and used to define the class. Whether your documents are used for layout or content classification depends on how each class is configured.

Note You do not need training documents during runtime. The project contains all of the extracted information required for classification.

If a class in your project is used to classify a form that has a consistent layout, layout classification usually returns a confident result and content classification is not needed. If you configure this class so it does not train for content classification, then only layout classification is attempted. Similarly, if a class has an inconsistent layout, the best results are usually returned by using the content classifier only. Classifiers are always processed in a specific order.

Before testing your classification settings for a class or project, train your project. After training your project, the documents in your Classification Training document set are used as a comparison for the documents you are processing. For a document to be successfully classified, the document needs a confidence greater than or equal to the configured classification thresholds.

Important After changing the properties of your classifiers, or after adding or deleting documents from your training set, you must retrain your project.

Once classification is configured, run some preliminary classification tests. Once you are satisfied with the preliminary classification tests, you can run more detailed classification benchmarks.

Note If you define fields at the project level, the extraction results are used to classify a document. For example, you can classify a document by extracting a bar code.