Optical Character Recognition (OCR)

To process unstructured documents and locate arbitrary content, the complete document needs to be processed by the OCR engine before any of the extraction methods can be applied. The OCR results are stored in a structured representation of the document that is saved as an .xdc (XDoc) file. All subsequent algorithms operate on the XDoc representation of the original file.

OCR is integrated transparently into Project Builder and Server. It is also performed automatically during runtime, and only on demand. This means that it is only done when the full text results of a page are needed. For example, when extraction is restricted to the first page of the document, and none of the classification methods require more than one page, OCR is only performed on the first page.