Recognition Languages for Advanced and Enhanced OCR Full Text Window

Use this window to select languages for OCR Full Text recognition.

Text Encoding

Select one of the following text encoding methods.

  • UTF16 is the native Unicode format, where every character or symbol is represented by a two-byte sequence.

  • UTF8 is a format that uses a string of bytes to represent a 16-bit Unicode string where ASCII text ( U+007F or less) remains unchanged as a single byte, U+0080-07FF (such as Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (such as Chinese, Japanese, and Korean) becomes a 3-byte sequence.

  • ANSI is a character set with one byte per symbol. If you select ANSI, you also need to select a code page. The list of available languages changes according to the code page you select.

Using an output format that supports only UTF16, such as Rich Text Format, Microsoft Word, or Microsoft Excel, overrides your text encoding selections (including the code page) and automatically sets the text encoding to UTF16.

The following table shows which output formats are supported by the three types of text encoding.

Output Format

UTF16

UTF8

ANSI

Plain Text (.txt)

Rich Text Format (.rtf)

HTML (.mht)

Microsoft Word (.doc)

Comma-Separated Values (.csv)

Microsoft Excel (.xls)

The list of code pages contains all the available code pages that can be used with the ANSI character set. When ANSI is not selected, the code page drop-down list is unavailable.

Each code page supports certain languages, as shown in the following table.

Code Page

Supported Languages

[ 1250 WINDOWS LATIN 2 ]

ALBANIAN

CROATIAN

CZECH

HUNGARIAN

POLISH

ROMANIAN

SERBIAN-LATIN

SLOVAK

SLOVENIAN

UZBEK-LATIN

[ 1251 WINDOWS CYRILLIC ]

AZERI-CYRILLIC

BELARUSIAN

BULGARIAN

KAZAKH

MACEDONIAN

MONGOL

RUSSIAN

SERBIAN-CYRILLIC

TATAR

UKRAINIAN

UZBEK-CYRILLIC

[ 1252 WINDOWS LATIN 1 ]

AFRIKAANS

BASQUE

BRAZILIAN

CATALAN

DANISH

DUTCH

DUTCH BELGIUM

ENGLISH

FINNISH

FRENCH

GALICIAN

GERMAN

GERMAN-LUXEMBOURG

GERMAN-NEW-SPELLING

ICELANDIC

INDONESIAN

IRISH

ITALIAN

MALAY

NORWEGIAN-BOKMAL

NORWEGIAN- NYNORSK

PORTUGUESE

SPANISH

SWAHILI

SWEDISH

[ 1253 WINDOWS GREEK ]

GREEK

[ 1254 WINDOWS TURKISH ]

TURKISH

[ 1257 WINDOWS BALTIC ]

ESTONIAN

LATVIAN

LITHUANIAN

Text Direction

These options apply only if one of the selected languages is Chinese - Simplified, Chinese - Traditional,Japanese, or Korean.

Select the orientation of the text on the form. If you select more than one language or if the output format is set to RTF, HTML, or Microsoft Word, the horizontal and vertical options are unavailable, and the Autodetect option is used. For best recognition results, select the Horizontal or Vertical options (rather than the Autodetect option) based on the recognized page.

Available Languages

This is a list of languages supported by the recognition engine. If ANSI is selected, only the languages associated with the current code page are displayed; otherwise, all languages are displayed in the list.

Selected Languages

This list contains the languages you have selected. Languages are added to the column in the order they are selected from the Available list. The first language that appears at the top of the list is the primary language, and the rest are secondary languages. You can select up to five languages.

You can add or remove items from the available or selected list by double-clicking a language. User-defined dictionaries are not supported if you select Chinese, Japanese, or Korean as the primary language.

Add Button

Adds a language to the Selected list. Select a language from the Available list and click Add. You can have a maximum of five languages in the Selected list.

Remove Button

Removes a language from the Selected list. Select a language from the Selected list and click Remove.