Layout ID and Extraction Online Learning

Trainable group locators are using layouts to group documents internally with similar layouts. For each document of the recognized layouts, specific extraction is run. The information about the layout in the "Layout ID" column is visible when viewing the Extraction Set in the Documents window. If you sort by the "Layout ID" you can see how many documents are collected per layout.

The Layout ID is class-based and each document of a different type is given a unique Layout ID. New classes receive a new Layout ID and existing and recognizable classes receive an existing Layout ID. This means that the same Layout ID is used for multiple classes.

The number of documents that are collected for one Layout ID varies. This depends whether a field list is the same for these document or if a conflicts occurs. When a conflict occurs for a layout, all documents trained after that will be collected. Nevertheless, there are more documents collected during production as are needed for setting up a knowledge base manually. This is related to the assumption that when a document is trained in Project Builder it is more reliable. To have the same extraction results, you need four documents in the Dynamic Knowledge Base during production.

This means that the easiest use case collects four documents at runtime. When documents are imported and the project is retrained, only three are required. One is marked as not needed.

After training the project, documents are collected that are not required by any of the trainable locators of that class and are marked No for the Used column. These documents can be safely deleted.

If at least one locator of that class requires the document, the value in this column should be set to Yes.

The Used column information is also used during production by the Knowledge Base Learning Server to help decide whether a document is added to the New Samples.