Reference

Class

Classes must have a unique name.

The training set and test set contains the representative items of class documents.

A collection of documents that have common features based on certain criteria. The document classifier operates on the basis of text and/or layout therefore only those documents can belong to a class that contain common words, expressions and/or they are similar in graphic appearance.

In the current version of Document Classifier Assistant, a project must have at least two classes defined and filled up with documents for the training-testing process to be executable.

Document

  • Scanned page or one from a PDF file: ;in case of multi-page documents or PDF files, the current version of Document Classifier Assistant classifies on a single page basis.

  • Generic text: the program supports the following text encodings: Unicode (both UTF-16 and UTF-8, with or without Byte Order Mark) and non-Unicode text encoded with Windows default codepage (as specified under Control Panel > Region and Language > Administrative pan > Change system locale).

Hidden class

The training process does not collect information on hidden classes.

Alien document

In classification we differentiate between two scenarios:

Contained or Closed project

All documents to be classified belong to a class that we defined.

Partial or Open project

There are documents not belonging to any class.

In a Partial project, documents not belonging to any class are called alien documents. The collection of alien documents do not have any features in common, they do not form a class.

Hidden document

The training process does not collect information on hidden documents.

Contained project

All input documents are guaranteed to fit one of the defined document classes. The contained project does not have alien documents. If you have a contained project, only misclassified or rejected error types can be manifested.

Partial project

Input documents can be outside of any defined document class. The partial project can have alien documents. If you have a partial project, misclassified, false negative and false positive error types can be manifested.

Classifying

While classifying, Document Classifier Assistant preprocesses and if needed, performs OCR on the scanned documents and then determines document features and compares these with the characteristic class features. If chose only layout classification, no OCR is performed. After this, it determines the confidence to each class of whether the document belongs to the given class. This confidence vector is returned by the Document Classifier API (kRecClassifyText, kRecClassifyPage and kRecClassifyDocument). The class with the highest confidence is the predicted class. If the confidence reaches the preset confidence threshold, then classification is confident. Otherwise, classification result is rejected, or alien (contained or partial project).

Training

Training is the process of Document Classifier Assistant analyzing the training set documents and determining characteristic class features. The training determines the characteristic features automatically, but certain textual features can be defined if required (see Phrases) or the metaword and stopword lists used for searching the textual features can be modified.

While training, the scanned documents are preprocessed (deskew and auto-rotation) and also OCR-ed, if we requested text based classification as well. Training always saves the project.

Training set

Document Classifier Assistant determines the characteristic features of classes on the basis of the training set, therefore the training set must contain the representative items of the class documents. When adding documents to the training set, the application does not copy the files, but stores a link to the original file. This is why you should never modify or delete the original files until the project exists. There is no generic rule as to exactly how many documents the training set should contain. If the documents of a class are very similar to each other (for example, forms whose words are much the same throughout) then less documents suffice for the training. If the documents of a class are diverse (for example, you want to separate texts on various topics), then a greater number of documents (maybe several hundreds) might be required for the training. When training, Document Classifier Assistant gives a warning on those classes whose training set contains less than ten documents.

Training document

These are the various text content files you collect to have the application memorize their features for a comparison against the test file features.

Testing

The testing process classifies the test set documents and displays the classification results. In project properties you can set it to be run while training.

It is advisable to run testing after training to check classification accuracy. The result is displayed in a table and besides, the confusion matrix error statistics and a chart is also shown. These data can be displayed with any confidence threshold value. Optimal confidence threshold can be calculated. If needed, the training and test sets can be modified and expanded and the training process re-run.

Test set

The test set is for checking the accuracy of classification after the training and for setting the optimal confidence threshold. When adding documents to the test set, Document Classifier Assistant does not copy the files, but only stores a link to the original files. This is why you should never modify or delete the original files until the project exists.

Test document

Test files serve as a checkpoint for how proper the training phase was. Check the Match, Confidence and Result columns in Test set > classname section for the results.

Classifier method

You can choose from three classifier methods, the default one is the Combined.

The Text method classifies on the basis of textual features; if the input is a picture file or a PDF, the document must be OCR-ed.

The Layout method classifies on the basis of graphic features, it does not work with text files.

Confidence level

The number (1-100) returned by the classifier that determines how certain the classification is. The greater the number, the most certain the classification.

Confidence threshold

The threshold (0-100) the user can set. If the Confidence level is less than this threshold when classifying, then classification is not confident. The default value is 50. If its value is increased, the number of misclassified and false positive errors decreases, but the number of false negative errors increases. While training and testing, the application determines the confidence threshold at which the weighted sum of the errors is the least (considering the weights that belong the errors - see error weights).

The application saves the specified threshold in the exported project file. The CSDK provides functions for querying and changing the confidence threshold.

False positive

In case of Partial / open project: the classifier takes an alien document as belonging to a class.

False negative

In case of Partial / open project: the classifier takes a class document as alien.

Rejected

In case of Contained / closed project: the classifier cannot determine the class.

Misclassified

The classifier takes a class document as belonging to a different class.

Stopwords

Stopwords are language dependent common terms that usually do not have significance in document classification (for example and, the) The pre-defined list can be modified by deleting or adding such words. Accurate matching is expected when searching for stopwords, and it is a case-insensitive process.

Metawords

These are language-dependent regular expressions to apply some generalized concept to the text, like dates, and dollar amounts The application contains predefined meta-words, but you can edit them or add new ones, including social security numbers or bank accounts (where the actual value is not necessarily relevant but the existence of such an entity is an important feature).

Phrases

Each class has a collection of phrases. The usage of phrases is optional, but they can help in certain situations to achieve better accuracy. When adding a new phrase to a class, it is also included in the phrase list of the other classes with Default weight. A phrase can only be deleted by removing it from the project level phrase list. This way it is also deleted from the phrase list of each class.

A Phrase has the following properties:

  • Id

  • Pseudo-word

  • Weight

  • Group

  • Position

Id

A string, which is inserted to the words-list of the document when a Pseudo-word is found in the processed text.

The same Id can belong to different pseudo words.

Pseudo-words

There are two types of Pseudo-words:

  • Phrase literals (n-grams): They consist of one or more consecutive literal words. They are searched in the target document in a fuzzy way, thus correcting some OCR errors.

  • Meta-words

Phrase weight

It can be selected from the following list:

  • Prohibited: If it exists in the target document then this class cannot be selected. Used to exclude some words that may mislead the classification (for example, an invoice may contain a bank account number that is more relevant to bank statements).

  • Default: It is taken normally, similarly to non-special word of the document

  • Important: It is given a higher weight in the statistical calculations. Used to emphasize those words in the class that the user deems more important than usually (for example, the word 'Invoice' for an invoice-type class).

  • Very important: It is given an even higher weight in the statistical calculations

  • Mandatory: If it does not exist in the target document, then this class cannot be selected

Phrase Group

Mandatory phrases have a Group property. This is for being able to define the 'and'-'or' relationship among them in case more of them exists. Phrases with the same Group have 'and', while those with different Group have 'or' relationship.

Phrase position

Only Mandatory phrases can have positions, but it is not required of them either. Phrase position is the physical location (bounding box) of the phrase on the page of the training document. If it has a valid value then it means that a target document can only be matched by this class if it has the phrase in the vicinity of this location (the whole phrase must fit in a bounding box that is the defined one plus a 'halo' around it).

Error weights

The error weights are used for setting the optimal confidence threshold. When changing the confidence threshold, the number of classification errors also changes. In case of a contained project, it is always with zero confidence that the least number of errors are manifested. In case of a partial project, the number misclassified and false positive errors decreases when increasing confidence threshold, but the number of false negative errors increases. The optimal value depends on the cost of these three error types. The error weights means these three cost types. After testing, the errors related to the change of the weighted sum of the errors depending on the confidence threshold can be displayed in a chart and the optimal confidence threshold can be queried.