General tab - Properties of Classification Locator window

The General tab enables you to select a classification project, configure the classification mode and specify minimum confidence. The following settings are available:

Referenced project file

Browse to the location where the reference project is located and select the desired project.

Automatic update from project file

Select this setting to ensure that the local copy of the reference project stays up-to-date with the original reference project file. For the best results, select this setting only if the referenced project is updated regularly. This setting is cleared by default.

Classification Mode

This group enables you to select one of the following classification modes:

  • Complete document (text only). This value means that the entire document and its text is used for classification.

    This mode does not consider any hierarchical classification rules, such as subtree classification or default classification results. This is the default value for this setting..

    The text used here for classification can be restricted to specific regions or pages.

  • Line by line (text only). This mode means that each text line is classified individually and returned as an alternative, if the confidence is high enough. The results are then sorted by confidence.

    The coordinates of the line are included with the returned alternatives and each alternative is highlighted on the document. This enables the calling project to access these coordinates as needed. For example, to find the highest line on a page that was classified as a specific value.

  • By Paragraph (text only). This mode uses paragraphs from a document for classification, honoring chapter or section numbers as well as numbered or bullet lists.

  • Complete document (hierarchical). This mode means that both layout and text classification can be used in the referenced classification project. For the actual classification process, the various settings in the classification project is used.

    The regions definition is used to determine how many pages need OCR.

    A final classification result can have a very low confidence if certain classification rules were applied. The result can also be lower than the results of other classes that are not the final classification result. The Set classification result to 100% setting should be used in that case.

    If you choose this value, you cannot define a default result for the locator and therefore the default result for the Result Mode pane is disabled. In case no result is found for this locator, the default classification result that is defined within the referenced project file is assigned as the locator result.

The above classification modes do not execute scripts, even classification scripts.

Classification Settings

This group has the following settings:

Min. confidence

Only classification results with a confidence higher than or equal to this value is returned as alternatives. The value for this setting is set to 70 by default..

Set classification result to 100%

This setting is only enabled when the Complete document (hierarchical) value is selected for the Classification Mode setting. When this setting is selected, the confidence of the alternative that is the final classification result always is 100%. This is important because the final classification result might be very low as a result of subtree classification or using the default classification result. If that were to happen, it would not be possible to distinguish between the final classification result and other possible alternatives. This setting is selected by default.

Min. words in a paragraph

When the Classification Mode is set to By Paragraph (text only), this setting is available so that you can configure the minimum size of a paragraph. This means that paragraphs with fewer words than what is specified here are not classified. The value for this setting is set to 20 by default..

Result Mode

This group enables you to select one of the following result modes as well as configure the following values:

  • Single topic.

    If this value is selected, only one class from the referenced project is used as a result in the alternative. This is the default value for this setting.

  • Multi topic. If this value is selected, a semicolon delimited list of the best class results is used as alternative values.

Max. number of results (0 = all)

This limits the number of returned classification results to the specified number. A value of 0 means all alternatives that meet the confidence requirements is returned. The value for this setting is set to 5 by default.

Default result

This setting is disabled unless the Complete document (text only) value is selected for the Classification Mode setting. If no classification result is found, this default result is assigned as the final result. If no default result is defined, the locator returns no value. The value for this setting is set to <none> by default.

The default result can be a text string such as "Nothing", "Unclassified", or something similar.

Definitions for the buttons at the bottom of this window can be found in Common Transformation Designer Buttons.