Advanced High Performance Zonal Settings Window - Mark and Dictionary Tab
Use this window to specify settings for marking (flagging) uncertain characters and for enhancing recognition engine accuracy by using dictionaries.
Related Tabs
Mark Settings
Affects the display of characters that the recognition engine cannot recognize with a minimum level of confidence.
- Error flag
-
Enter an error flag. The default error flag is the null character, which shows up as a blank in the window. The error flag is used by the recognition engines to mark characters that cannot be identified with a certain level of confidence that you specify by setting the mark level. You can specify a single character ( ^ is commonly used).
- Mark Levels
-
You can specify the minimum level of confidence to accept for character recognition. Characters that do not meet this minimum level are marked with the error flag.
- General
-
If you select the General option, the adjacent drop-down list gives you a choice of three levels of confidence. The default level is Medium. The other choices are Low and High.
A setting of
Low
means that you accept a minimal level of recognition confidence, resulting in fewer error flags and possibly a greater number of incorrect characters (false positives). This setting is suitable for data that is not very critical.A setting of
Medium
means that you accept a moderate amount of recognition confidence, resulting in more error flags than theLow
setting and fewer errors in the results. This setting is suitable for data that is moderately critical.A setting of
High
means that you require a great degree of recognition confidence, resulting in many error flags that require attention but better accuracy. This setting is suitable for data that is very critical.
- Specific
-
When selected, you can specify precise levels of confidence for machine or hand-printed characters.
- Machine print
-
Select or enter a value from 0 to 100. This value represents the
marking level
for machine-printed characters. Characters that fall below this point are marked with the error flag. If you set this value to 0, almost no characters are marked, since it is very rare to have zero confidence for any given character. If you set this value to 100, almost every character is marked, since it is very rare to have absolute confidence for any given character. The default is 40.
- Handprint
-
Select or enter a value from 0 to 100. This value represents the
marking level
for hand-printed characters. Characters that fall below this point are marked with the error flag. If you set this value to 0, almost no characters are marked, since it is very rare to have zero confidence for any given character. If you set this value to 100, almost every character is marked, since it is very rare to have absolute confidence for any given character. The default is 40.
Dictionary Settings
These settings affect the way Kofax Capture uses the Zonal dictionary. Keep in mind that the Mark and Dictionary tab cannot be used to specify which Zonal dictionary to use. The Zonal dictionary is set in the Field Type Properties or the Create Field Type window.
- Word type
-
You can specify the
word type
to use when comparing recognized text against the dictionary. As used by Kofax Capture, "word type" refers to the method by which words are separated in the recognized text. The most common method used for separating English words is with a space, but other methods are possible. For example, some words are separated by punctuation marks or tabs instead of spaces.To understand how words are extracted from the recognized text, you must first know a little about how the recognition engine treats blank spaces in the text. The recognition engine returns not only the size of a blank space, but also the number of consecutive blanks. This number is calculated based on the character pitch setting for the line. If this is set to a variable type, the character spacing is calculated using the average width of all characters in the line. If this is not possible because, for example, the font changes within the line, the calculation is performed on a word by word basis. The average character width for the current word is then used for calculating the number of blank spaces preceding this word.
Depending on the print type setting (such as handprint or machine print), blank spaces influence the separation of the results into words. For hand-printed text, each blank defines the boundary of a word. If the spaces between characters are sufficiently large, each character is interpreted as a word. On the other hand, because of the regularity of machine-printed characters, the recognition engine can frequently discern
intentional
fromunintentional
spaces.As an example, consider the character string
ABC E F G HIJ.
If the original is hand-printed, the string is separated into words
ABC
,E
,F
,G
andHIJ
because of the spaces between theE
,F
andG.
With machine print, however, the same string resolves to the words
ABC
,EFG
andHIJ.
Because of the regular spacing found in machine print, the recognition engine can derive a "typical" spacing pattern. Consequently, the algorithmguesses
that the space on either side of theF
is not intended.There are 7 possible word type settings.
- Logical
-
Logical words are groups of alphabetic characters or numerals separated by spaces, certain punctuation marks, or font changes.
- Logical Alpha
-
Logical Alpha words are groups of alphabetic characters separated by spaces, certain punctuation marks, or font changes.
- Logical Numerical
-
Logical numerical words are groups of numerals separated by spaces, certain punctuation marks, or font changes.
- Geometrical
-
Geometrical words are any character string separated by the border of the zone, by spaces, or by font changes.
- Whole Line
-
The entire line is treated as a single word. Spaces and other breaks are ignored.
- Alpha Line
-
This is the same as
Whole Line,
except logical numeric words are ignored. - Numerical Line
-
This is the same as
Whole Line,
except that logical alpha words are ignored.
- Maximum length difference
-
You can determine how closely the length (number of characters) of a recognized word must match a dictionary word. The allowable range for this field is 0-3. If you set the difference to 0, the lengths must exactly match. If you set it to 3, the recognized text can be up to three characters longer or shorter than a
matching
dictionary word. If set to some value other than 0, this feature allows the recognition engine to select abest
match from the dictionary when no exact match is available. For example, if set to1,
the recognized wordbook
would be matched withbooks
in the dictionary.If the difference exceeds the limit you have specified, the recognized word does not
match
the dictionary word. In general, you should get the best results by leaving this field at the default value of 1.