RecoStar recognition trigram modes

Trigrams are combinations of three letters that are commonly found in many languages. For example, common English trigrams include "ing" and "ion."

RecoStar can take advantage of trigrams to enhance recognition accuracy. Trigrams can check and optionally repair combinations of letters that have both a low confidence rating and a low frequency of occurrence.

Consider these examples:


An image that shows the result of an extraction result when the trigram ing is used to improve results.

In the first case, the image file for "Walking" suffers from drop-outs. In particular the "n" is badly faded. The recognition engine cannot decide if it is an "r" followed by an "i," or if it is a single "n," so the character is marked as "rejected" in the initial results. Trigram analysis is applied to the initial result and the recognition engine decides that the most likely combination of three letters, in this case, is "ing."

In the second case, the image file contains substantial noise. Because of this noise, the second "i" in "Dictionary" is interpreted as the letter "l" Trigram analysis shows that "ion" is more likely than "lon" and the word is corrected. It is important to keep in mind that trigram analysis is a statistical process. RecoStar ships with trigram tables for most supported languages. Each table contains a list of possible three letter combinations and their frequency of occurrence in that language. Although there are thousands of such combinations, many of them are almost never be used so their frequency of occurrence is near zero.

There may be rare occasions where your data contains many uncommon trigrams. For example, a list of Chicago radio stations might include WGN, WLS, WNVR, WKTAF, WZRD, WBEZ, or WXRT. In such cases, if you notice problems, you should consider disabling trigrams for your recognition profile.