Recognition Languages for Advanced and Enhanced OCR Full Text Window
Use this window to select languages for OCR Full Text recognition.
Text Encoding
Select one of the following text encoding methods.
-
UTF16 is the native Unicode format, where every character or symbol is represented by a two-byte sequence.
-
UTF8 is a format that uses a string of bytes to represent a 16-bit Unicode string where ASCII text ( U+007F or less) remains unchanged as a single byte, U+0080-07FF (such as Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (such as Chinese, Japanese, and Korean) becomes a 3-byte sequence.
-
ANSI is a character set with one byte per symbol. If you select ANSI, you also need to select a code page. The list of available languages changes according to the code page you select.
Using an output format that supports only UTF16, such as Rich Text Format, Microsoft Word, or Microsoft Excel, overrides your text encoding selections (including the code page) and automatically sets the text encoding to UTF16.
The following table shows which output formats are supported by the three types of text encoding.
Output Format |
UTF16 |
UTF8 |
ANSI |
---|---|---|---|
Plain Text (.txt) |
• |
• |
• |
Rich Text Format (.rtf) |
• |
||
HTML (.mht) |
• |
• |
• |
Microsoft Word (.doc) |
• |
||
Comma-Separated Values (.csv) |
• |
• |
• |
Microsoft Excel (.xls) |
• |
The list of code pages contains all the available code pages that can be used with the ANSI character set. When ANSI is not selected, the code page drop-down list is unavailable.
Each code page supports certain languages, as shown in the following table.
Code Page |
Supported Languages |
---|---|
[ 1250 WINDOWS LATIN 2 ] |
ALBANIAN CROATIAN CZECH HUNGARIAN POLISH ROMANIAN SERBIAN-LATIN SLOVAK SLOVENIAN UZBEK-LATIN |
[ 1251 WINDOWS CYRILLIC ] |
AZERI-CYRILLIC BELARUSIAN BULGARIAN KAZAKH MACEDONIAN MONGOL RUSSIAN SERBIAN-CYRILLIC TATAR UKRAINIAN UZBEK-CYRILLIC |
[ 1252 WINDOWS LATIN 1 ] |
AFRIKAANS BASQUE BRAZILIAN CATALAN DANISH DUTCH DUTCH BELGIUM ENGLISH FINNISH FRENCH GALICIAN GERMAN GERMAN-LUXEMBOURG GERMAN-NEW-SPELLING ICELANDIC INDONESIAN IRISH ITALIAN MALAY NORWEGIAN-BOKMAL NORWEGIAN- NYNORSK PORTUGUESE SPANISH SWAHILI SWEDISH |
[ 1253 WINDOWS GREEK ] |
GREEK |
[ 1254 WINDOWS TURKISH ] |
TURKISH |
[ 1257 WINDOWS BALTIC ] |
ESTONIAN LATVIAN LITHUANIAN |
Text Direction
These options apply only if one of the selected languages is Chinese - Simplified, Chinese - Traditional,Japanese, or Korean.
Select the orientation of the text on the form. If you select more than one language or if the output format is set to RTF, HTML, or Microsoft Word, the horizontal and vertical options are unavailable, and the Autodetect option is used. For best recognition results, select the Horizontal or Vertical options (rather than the Autodetect option) based on the recognized page.
Available Languages
This is a list of languages supported by the recognition engine. If ANSI is selected, only the languages associated with the current code page are displayed; otherwise, all languages are displayed in the list.
Selected Languages
This list contains the languages you have selected. Languages are added to the column in the order they are selected from the Available list. The first language that appears at the top of the list is the primary language, and the rest are secondary languages. You can select up to five languages.
You can add or remove items from the available or selected list by double-clicking a language. User-defined dictionaries are not supported if you select Chinese, Japanese, or Korean as the primary language.
Add Button
Adds a language to the Selected list. Select a language from the Available list and click Add. You can have a maximum of five languages in the Selected list.
Remove Button
Removes a language from the Selected list. Select a language from the Selected list and click Remove.