Format Definitions tab - Format Locator Properties window

The Formats tab of the Format Locator Properties window enables you to add one or more format definitions. These definitions are used to match and extract information from documents using pattern matching methods such as regular expressions and simple expressions. Additional algorithms such as trigrams and Levenshtein are available. In addition to pattern matching and algorithms, you can also use dictionaries. Any content that matches a dictionary entry is returned. The following options are available:

Formats

This group has the following options:

Formats

At the top of the Formats list, the following menu options are available:

Delete icon Delete selected format definition

Select a format definition from the list below for deletion.

Add icon Regular expression

Regular expressions can be used to precisely specify complex format patterns. You can express a specific pattern using regular expression syntax.

Add icon Simple expression

Simple expressions can be used to specify simple format patterns. In simple expressions, most characters represent themselves. Some characters however, have a special meaning.

Add icon Levenshtein

The Levenshtein format definition is an error-tolerant algorithm that finds each occurrence of the specified string, so it find matches were one or two characters are incorrect due to typographical errors or misspellings. This option is good for OCR error tolerance. The confidence of the result is very sensitive to the number of OCR errors.

Add iconTrigram

The trigram format definition is also an error-tolerant algorithm. Here, an expression is separated into groups of three characters called trigrams. The number of identical groups determines if there is a match. Use this option for short phrases or when matching phonetic text.

Insert a Sample Regular Expression icon Insert sample or partial simple expression in the selected simple expression

This option is available only when a simple expression is selected in the list of Formats.

Select to add a partial predefined simple expression to the simple expression. Choose from the following options:

Numeric

For numbers, choose from the following predefined format definitions:

  • # Represents digits 0123456789 as well as upper and lowercase O

  • #[m-n] Between m and n Numbers

  • ‘#####’ Any Numeric Expression with a Fixed Length

Alphabetic

For alphabetic characters, choose from the following predefined format definitions:

  • @ Any Single Alphabetic Character (a-Z)

  • @[m-n] Any Alphabetic Character (a-Z) Repeated Between m and n Times

  • '@@@@@' Any Alphabetic Expression with a Fixed Length

Alphanumeric

For alphanumeric characters, choose from the following predefined format definitions:

  • ? Alphanumeric Character

  • ?[m-n] Any Alphanumeric Character Repeated Between m and n Times

  • '?????' Any Alphanumeric Expression with a Fixed Length

Insert a Sample Regular Expression icon Insert sample or partial regular expression or dictionary into the selected regular expression

This option is available only when a regular expression is selected in the list of Formats.

Select to add a partial predefined regular expression to the selected regular expression. Choose from the following options:

Numbers

For numbers, choose from the following predefined format definitions:

  • \d Number

  • \d? Optional number

  • \d+ One or more numbers

  • \d{n} n numbers

  • \d{m-n} Between m and n numbers

Characters

For alphabetic characters, choose from the following predefined format definitions:

  • . Any single character

  • .? Optional character

  • .+ Any character one or more times

  • .{n} Any character n times

  • .{m-n} Any character between m and n characters

Dictionary

Select a dictionary to insert from the list of available databases. If the database you want is not listed, click Configure Dictionaries to add another dictionary to your project.

Important If there are no dictionaries added to your project, this option is not available. Click Configure Dictionaries to add a dictionary to your project.
Configure Dictionaries

From the submenu, select one of the following:

Dictionary Settings

Click here to open the Dictionary Settings window where you can select a dictionary added via the Project Settings.

Entries in that dictionary will be located on a document. For example, if you use a dictionary that contains commonly used month names, that dictionary can be used as part of a format definition that finds dates on a document. Since the dictionary contains December and Dec, both can be located on a document.

Refresh dictionary

For each dictionary, an entry exists for you to update a dictionary to ensure that it is using the most up-to-date data.

The following options are available for each format:

Use Format

For each format to use, enable or clear the check box. For example, enabling the format definitions one-at-a-time helps you to test each format separately. If you have two similar formats, test them to see which one provides better performance in production. To disable the less effective format, clear the Use Format option. Do not delete the format because it may be useful at a later date.

Format Type
This shows the format type. This can be a regular expression, simple expression, Levenshtein, or trigram format definition.
Format Expression
This shows the syntax of the format definition.
Whole Word

This option is available only for regular expression format definitions.

Selecting this option Whole Word icon assigns a lower level of confidence to unwanted alternatives that lie within longer strings of characters. Consider a format designed to search for a 5-digit zip code. Selecting this option assigns a low level of confidence to unwanted alternatives that appear within longer numbers on the document, such as 11-digit telephone numbers.

Ignore Case

If you select this option Ignore Case icon, the Format Locator ignores case for this format. If it is not selected, the Format Locator will ignore any alternative that differs in case. For example, a regular expression that searches for Last Name with the ignore case option disabled does not return Last name as a confident result.

Ignore Blanks

If you select this option Ignore Blanks icon, spaces are ignored when a format is run. For example, a Social Security Number may be printed with spaces between the numbers. This option ensures that a result is found.

Search Exact

This option is available only for regular expression format definitions that contain a dictionary.

If this option Search Exact icon is selected, a value is only returned if there is an exact match in the dictionary.

This means that you are able to ensure precise matching. If January was misread as Janvary, the misread date would not be returned.

Ignore Characters
Type in characters that can be ignored by the format definition search.
Description
Type a description of the format definition. Include sample matches so it is clear what is supposed to match here.
Error Description
If there are any issues with the format definition, a read-only error is displayed in this column.

Below the Formats list is the following option:

Test value

Type in some text to test which of the format definitions find the desired string.

Definitions for the buttons at the bottom of this window can be found in Common Transformation Designer Buttons.