General tab

Use this tab to set general OCR attributes.

Option

Description

Activate

Use this combo box to activate the component according to a condition (see Conditional fields under Appendices).

CPU usage

Use this setting to limit processor usage by the OPOCR component. The default value is 100%. Modifying the setting requires some knowledge of processor activity. For example, if you have a four processor system, you would choose a quartile percentage (25%, 50%, 75%, or 100% rather than 40%). You can experiment with this setting to maximize efficient use of resources on a system.

Pass through

Set this option to "Yes" to pass the original document to subsequent components in the workflow. You can use conditions in this field (see Conditional fields under Appendices).

If you enter an invalid condition into Pass through box, the activation is "Yes" by default.
Input files

Defines the file types that the component will process.

Enter a wildcard character and extension (such as *.pdf) to define a file type. Separate entries using a comma (,) or semicolon (;). By default this box lists the following file types: *.pdf; *.tif; *.tiff; *.jpg; *.jpeg; *.jfif; *.bmp; *.pcx; *.dcx; *.jp2; *.jpc; *.j2c; *.gif; *.png; *.jb2

You can use the following wildcard characters to specify file types:

  • * — Any string of characters.

  • ? — Any single character.

Resolution

You can use this setting to minimize resource use for very large files.

  • Unaltered on file load — Forces OPOCR to load images using the file resolution, regardless of the size of the file.

  • Engine-optimized on file load — Allows OPOCR to change the image resolution that it uses to load an image for processing. This allows the engine to minimize memory usage for very large files.

Languages

Select the language of the text to be recognized from the list. If necessary, multiple languages may be entered by separating language names with a comma. You can use RRTs in this field to define language recognition at run time.

RRTs used in this text box should be replaced with internal language names. To view internal language names, expand a language category node in the Select language dialog box and select a language. The internal name appears at the bottom of the dialog box.
Recognition mode

Select the mode of recognition, that is, a desired balance of speed/errors rate. There are three recognition modes available:

  • Full mode — The recognition will be slow, but the error rate will be the least possible.

  • Balanced mode — The middle level mode between Full and Fast modes.

  • Fast mode — Select this check box to provide 2-2.5 times faster recognition speed at the cost of a moderately increased error rate (1.5-2 times more errors). On good print quality texts with simple layouts, the OPOCR component makes an average of 1-2 errors per page, and such moderate increase in error rate can be easily tolerated in many cases, such as full text indexing with "fuzzy" searches.

Recognition type

Select the type of the text to be recognized. Text type settings influence recognition speed and quality. If text type is incorrect, the OPOCR engine might recognize images slowly and less accurately. The following options are available:

  • Default — Denotes a common typographical type of text.

  • Dot matrix — Used for text typed with matrix printer.

  • MICR — Used for Magnetic Ink Characters Recognition, that is, recognition of numeric characters printed with special magnetic ink.

  • MICR CMC-7 — A Special barcode font used for Magnetic Ink Character Recognition that contains ten specially designed numeric characters 0 through 9, and five special symbols. This font is widely used in Europe, Brazil, and Mexico.

  • OCR A — Denote text printed in a corresponding monospaced font designed for Optical Characters Recognition. If follows the ISO 1073-1:1976 standard. There is also a German standard for OCR-A called DIN 66008.

  • OCR B — Denote text printed in a corresponding monospaced font designed for Optical Characters Recognition. It follows the ISO 1073/II-1976 (E) standard, refined in 1979 ("letterpress" design, size I).

  • Omnifont — Used to recognize virtually any font that maintains fairly standard character shapes.

  • Detect — Specifies that the OPOCR engine detect text type automatically. Auto-detection can slow character recognition.

Output OCR text as

This group allows you to specify how to output the recognized text.

File

Select this check box if you want to save recognized text as a file. The file is passed to the subsequent components.

Specify the file format for saving the recognition results manually or by selecting it from the drop-down list. Possible formats are TXT, CSV, HTML, PDF, PDF/A, PDF (Keep original), PPTX, RTF, DOCX, XLS, XLSX, and OPD. If needed, multiple file formats may be entered with a "," separating formats. You can use RRTs from another component in this box. Specify the parameters of the output file in the Format Settings dialog box (see Format Settings).

Set up output file

Click this button to open Format settings dialog box.

Run-time replacement ~FRO::OCRText~

Select this check box to save recognized text as the ~FRO::OCRText~ Runtime Replacement Tag.

Zoned OCR

Select this check box to use zoned OCR. Recognized fields will be output as RRTs or/and as CSV files.

Set up zoned OCR

Click this button to configure settings for a zoned OCR. This button is enabled only if the Zoned OCR check box is selected. This button opens the Setup Zoned OCR dialog box.

It is mandatory to select at least one of the check boxes in the Output OCR text as group.