Format Definitions tab - Properties of Format Locator window

The Formats tab enables you to add one or more format definitions. These definitions are used to match and extract information from documents using pattern matching methods such as regular expressions and simple expressions. Additional algorithms such as trigrams and Levenshtein are available. In addition to pattern matching and algorithms, you can also use dictionaries. Any content that matches a dictionary entry is returned.

When a document is skewed, format recognition often fails due to overlapping lines. Because of this, it is recommended to apply image processing to your documents to ensure that they are not skewed before they are processed by the Format Locator.

The following settings are available:

Formats

This group has the following settings:

Formats

At the top of the Formats list, the following menu settings are available:

Delete icon Delete selected format definition

Select a format definition from the list below for deletion.

Add icon Regular expression

Regular expressions can be used to precisely specify complex format patterns. You can express a specific pattern using regular expression syntax.

You can chose from the following predefined expressions.

Amounts

The following predefined regular expressions for amounts are available.

  • 123,45 Amount With Coma

  • $123.45 Amount With Dot

  • $123 45 Amount With Blank

Dates

The following predefined regular expressions for dates are available.

  • 24.01.2003 Numerical Date DDMM(YY)YY

  • 01.24.2003 Numerical Date MMDD(YY)YY

  • 6. Dezember 2003 German Date I

  • 6. Dez 2003 German Date II

  • December 6, 2003 English Date I

  • 6 December 2003 English Date II

  • Dec 6, 2003 English Date III

  • 6 Dec 2003 English Date IV

Add icon Simple expression

Simple expressions can be used to specify simple format patterns. In simple expressions, most characters represent themselves. Some characters however, have a special meaning.

Add icon Levenshtein

The Levenshtein format definition is an error-tolerant algorithm that finds each occurrence of the specified string, so it find matches where one or two characters are incorrect due to typographical errors or misspellings. This setting is good for OCR error tolerance. The confidence of the result is very sensitive to the number of OCR errors.

Add iconTrigram

The trigram format definition is also an error-tolerant algorithm. Here, an expression is separated into groups of three characters called trigrams. The number of identical groups determines if there is a match. Use this setting for short phrases or when matching phonetic text.

Insert a Sample Regular Expression icon Insert a sample or partial simple expression into the selected simple expression

This setting is available only when a simple expression is selected in the list of Formats.

Select to add a partial predefined simple expression to the simple expression. Choose from the following settings:

Numeric

For numbers, choose from the following predefined format definitions:

  • # Number (0-9)

  • #[m-n] Between m And n Numbers

  • ‘#####’ Any Numeric Expression with a Fixed Length

Alphabetic

For alphabetic characters, choose from the following predefined format definitions:

  • @ Any Single Alphabetic Character (a-Z)

  • @[m-n] Any Alphabetic Character (a-Z) Repeated Between m and n Times

  • '@@@@@' Any Alphabetic Expression with a Fixed Length

Alphanumeric

For alphanumeric characters, choose from the following predefined format definitions:

  • ? Alphanumeric Character

  • ?[m-n] Any Alphanumeric Character Repeated Between m and n Times

  • '?????' Any Alphanumeric Expression with a Fixed Length

Insert a Sample Regular Expression icon Insert sample or partial regular expression or a dictionary into the selected regular expression

This setting is available only when a regular expression is selected in the list of Formats.

Select to add a partial predefined regular expression to the selected regular expression. Choose from the following settings:

Numbers

For numbers, choose from the following predefined format definitions:

  • \d Number

  • \d? Optional number

  • \d+ One or more numbers

  • \d{n} n numbers

  • \d{m-n} Between m and n numbers

Characters

For alphabetic characters, choose from the following predefined format definitions:

  • . Any single character

  • .? Optional character

  • .+ Any character one or more times

  • .{n} Any character n times

  • .{m-n} Any character between m and n characters

Dictionary

Select a dictionary to insert from the list of available databases. If the database you want is not listed, click Configure Dictionaries to add another dictionary to your project.

If there are no dictionaries added to your project, this setting is not available. Click Configure Dictionaries to add a dictionary to your project.

Configure Dictionaries

From the submenu, select one of the following:

Dictionary Settings

Click here to open the Dictionary Settings window where you can select a dictionary added via the Project Settings.

Entries in that dictionary will be located on a document. For example, if you use a dictionary that contains commonly used month names, that dictionary can be used as part of a format definition that finds dates on a document. Since the dictionary contains "December" and "Dec," both can be located on a document.

Refresh dictionary

For each dictionary, an entry exists for you to update a dictionary to ensure that it is using the most up-to-date data.

The following settings are available for each format:

Use

For each format to use, enable or clear the check box. For example, enabling the format definitions one-at-a-time helps you to test each format separately. If you have two similar formats, test them to see which one provides better performance in production. To disable the less effective format, clear the Use setting. Do not delete the format because it may be useful at a later date.

Format Type
This displays the format type. This can be a regular expression, simple expression, Levenshtein, or trigram format definition.
Format Expression
This displays the syntax of the format definition.
Whole Word

This setting is available only for regular expression format definitions.

Selecting this setting Whole Word icon assigns a lower level of confidence to unwanted alternatives that lie within longer strings of characters. Consider a format designed to search for a 5-digit zip code. Selecting this setting assigns a low level of confidence to unwanted alternatives that appear within longer numbers on the document, such as 11-digit telephone numbers.

Ignore Case

If you select this setting Ignore Case icon, the Format Locator ignores case for this format. If it is not selected, the Format Locator will ignore any alternative that differs in case. For example, a regular expression that searches for "Last Name" with the ignore case setting disabled does not return "Last name" as a confident result.

Ignore Blanks

If you select this setting Ignore Blanks icon, spaces are ignored when a format is run. For example, a Social Security Number may be printed with spaces between the numbers. This setting ensures that a result is found.

Search Exact

This setting is available only for regular expression format definitions that contain a dictionary.

If this setting Search Exact icon is selected, a value is only returned if there is an exact match in the dictionary.

This means that you are able to ensure precise matching. If "January" was misread as "Janvary", the misread date would not be returned.

Ignore Characters
Type in characters that can be ignored by the format definition search.
Description
Type a description of the format definition. Include sample matches so it is clear what is supposed to match here.
Error Description
If there are any issues with the format definition, a read-only error is displayed in this column.

Below the Formats list is the following setting:

Test value

Type in some text to test which of the format definitions find the desired string.

Definitions for the buttons at the bottom of this window can be found in Common Transformation Designer Buttons.