Advanced High Performance Zonal Settings Window - Mark and Dictionary Tab

Use this window to specify settings for marking (flagging) uncertain characters and for enhancing recognition engine accuracy by using dictionaries.

Related Tabs

Mark Settings

Affects the display of characters that the recognition engine cannot recognize with a minimum level of confidence.

Error flag

Enter an error flag. The default error flag is the null character, which shows up as a blank in the window. The error flag is used by the recognition engines to mark characters that cannot be identified with a certain level of confidence that you specify by setting the mark level. You can specify a single character ( ^ is commonly used).

Mark Levels

You can specify the minimum level of confidence to accept for character recognition. Characters that do not meet this minimum level are marked with the error flag.

General

If you select the General option, the adjacent drop-down list gives you a choice of three levels of confidence. The default level is Medium. The other choices are Low and High.

A setting of Low means that you accept a minimal level of recognition confidence, resulting in fewer error flags and possibly a greater number of incorrect characters (false positives). This setting is suitable for data that is not very critical.

A setting of Medium means that you accept a moderate amount of recognition confidence, resulting in more error flags than the Low setting and fewer errors in the results. This setting is suitable for data that is moderately critical.

A setting of High means that you require a great degree of recognition confidence, resulting in many error flags that require attention but better accuracy. This setting is suitable for data that is very critical.

Specific

When selected, you can specify precise levels of confidence for machine or hand-printed characters.

Machine print

Select or enter a value from 0 to 100. This value represents the marking level for machine-printed characters. Characters that fall below this point are marked with the error flag. If you set this value to 0, almost no characters are marked, since it is very rare to have zero confidence for any given character. If you set this value to 100, almost every character is marked, since it is very rare to have absolute confidence for any given character. The default is 40.

Handprint

Select or enter a value from 0 to 100. This value represents the marking level for hand-printed characters. Characters that fall below this point are marked with the error flag. If you set this value to 0, almost no characters are marked, since it is very rare to have zero confidence for any given character. If you set this value to 100, almost every character is marked, since it is very rare to have absolute confidence for any given character. The default is 40.

Dictionary Settings

These settings affect the way Kofax Capture uses the Zonal dictionary. Keep in mind that the Mark and Dictionary tab cannot be used to specify which Zonal dictionary to use. The Zonal dictionary is set in the Field Type Properties or the Create Field Type window.

Word type

You can specify the word type to use when comparing recognized text against the dictionary. As used by Kofax Capture, "word type" refers to the method by which words are separated in the recognized text. The most common method used for separating English words is with a space, but other methods are possible. For example, some words are separated by punctuation marks or tabs instead of spaces.

To understand how words are extracted from the recognized text, you must first know a little about how the recognition engine treats blank spaces in the text. The recognition engine returns not only the size of a blank space, but also the number of consecutive blanks. This number is calculated based on the character pitch setting for the line. If this is set to a variable type, the character spacing is calculated using the average width of all characters in the line. If this is not possible because, for example, the font changes within the line, the calculation is performed on a word by word basis. The average character width for the current word is then used for calculating the number of blank spaces preceding this word.

Depending on the print type setting (such as handprint or machine print), blank spaces influence the separation of the results into words. For hand-printed text, each blank defines the boundary of a word. If the spaces between characters are sufficiently large, each character is interpreted as a word. On the other hand, because of the regularity of machine-printed characters, the recognition engine can frequently discern intentional from unintentional spaces.

As an example, consider the character string ABC E F G HIJ.

If the original is hand-printed, the string is separated into words ABC, E, F, G and HIJ because of the spaces between the E, F and G.

With machine print, however, the same string resolves to the words ABC, EFG and HIJ. Because of the regular spacing found in machine print, the recognition engine can derive a "typical" spacing pattern. Consequently, the algorithm guesses that the space on either side of the F is not intended.

There are 7 possible word type settings.

Logical

Logical words are groups of alphabetic characters or numerals separated by spaces, certain punctuation marks, or font changes.

Logical Alpha

Logical Alpha words are groups of alphabetic characters separated by spaces, certain punctuation marks, or font changes.

Logical Numerical

Logical numerical words are groups of numerals separated by spaces, certain punctuation marks, or font changes.

Geometrical

Geometrical words are any character string separated by the border of the zone, by spaces, or by font changes.

Whole Line

The entire line is treated as a single word. Spaces and other breaks are ignored.

Alpha Line

This is the same as Whole Line, except logical numeric words are ignored.

Numerical Line

This is the same as Whole Line, except that logical alpha words are ignored.

Maximum length difference

You can determine how closely the length (number of characters) of a recognized word must match a dictionary word. The allowable range for this field is 0-3. If you set the difference to 0, the lengths must exactly match. If you set it to 3, the recognized text can be up to three characters longer or shorter than a matching dictionary word. If set to some value other than 0, this feature allows the recognition engine to select a best match from the dictionary when no exact match is available. For example, if set to 1, the recognized word book would be matched with books in the dictionary.

If the difference exceeds the limit you have specified, the recognized word does not match the dictionary word. In general, you should get the best results by leaving this field at the default value of 1.