XDocument

Within Kofax Transformation Modules, the extraction of field data is based on a representation of the document called the XDocument, which contains all of the layout elements of the document. Layout elements are text elements (words and their geometries, fonts, and location) and graphical elements (lines, logos, and textures). This internal representation is used by the extraction algorithms throughout the complete process.

The XDocument (CscXDocument) object is part of the CscXDocument library and is the core object for representing document information. The XDocument is created during OCR for scanned documents or by a conversion filter for electronic documents (for example, PDF). This standardized representation ensures format independence for the entire process. The XDocument also functions as a container for extraction and analysis results to be used in further processing steps.

The XDocument represents an array of logical structures that make up the source file(s). This means that it can contain a representation of several TIFF images on which OCR can be performed. The XDocument thus consists of a representation that contains pages, text lines, and words. If another OCR with different settings is performed, the XDocument contains a second representation that can have a different structure of lines and words. Basically, an XDocument is used to store all of the accumulated information about a file (or several files) from its creation through all kinds of analysis (for example, OCR and line analysis), classification, and data extraction, up to export or final archiving.

The XDocument consists of a collection of representations and a CDocument that defines the page structure of all representations and provides information about the source files. The CDocument represents a compound document. You can define a number of source files or even plain text snippets that you want to use to construct an XDocument. The files can be of different types (for example, TEXT, PDF, TIFF, or Microsoft Word). Whenever a source file is added to the CDocument, its page structure is analyzed, and a Page is created. All of those pages together define the page structure for all representations in an XDocument.

For more information, see the Library Model of the XDocument Object and the Library Model of the XDocField Object.