Training Sets and Knowledge bases

A collection of training samples is called a Training Set. Each project in Transformation Designer includes one Training Set for Classification/Separation and one for Extraction.

Training Sets and Knowledgebases

Classification/Separation Set

The Classification/Separation Set usually contains multipage documents. As learn-by-example classification does not require much setup, it is quite simple to train the system to use a Classification/Separation Set.

Classification Training Set

This Classification Set includes the following details.

Filename
Name of the file.
Use
Defines whether to use that sample to build the model and allows the Administrator to exclude documents from the model without having to delete them.
Assigned class
Assigns each sample to a document type to allow TotalAgility to distinguish (classify) documents. In Machine Learning terms, this process is called "labeling the sample."
Classification Result and Confidence
Displays the classification results when you test the documents that are training samples. TotalAgility does not 100% trust the samples, as some samples could be incorrectly labeled. For example, if you classify your samples (see TotalAgility Benchmarking for a more sophisticated way of doing that), and a sample in Document Type A gets classified as Document Type B, it may indicate that the sample was incorrectly labeled, or that you need more samples to better classify between A and B.

Extraction Set

Teaching TotalAgility to extract data from documents involves providing samples for the document type and then using a point and click interface to show TotalAgility the location of the data that you want it to extract. Based on this information, TotalAgility learns to extract the desired data.

Extraction Training Set

This Extraction Set includes the following details.

Filename
Name of the file.
Use
Defines whether to use that sample to build the model and allow the Administrator to exclude documents from the model without having to delete them.
Trained
Indicates if a sample document was used to create the model. If TotalAgility has too many samples that do not provide new information, they are ignored. The Trained flag indicates which ones are ignored. You can delete those samples without affecting the model to keep the training set small.
Layout ID
The ID of the internal class that the specific extraction (See Specific versus generic learning) creates to identify the layout later. The same ID means the same internal class. Typically, TotalAgility uses three samples per internal LayoutID; the rest are assigned the "Not trained" flag and can be deleted.
Conflicts
Indicates the conflicts. Two documents conflict with each other if they are in the same Layout ID, but the same field has been trained differently in each document. See Resolve conflicts.
Validation Information

On the Locator level, some locators have validation rules that verify that the trained and extracted data is correct. If you train a field with data that violates such a rule, it is shown in the column as a warning. This information helps to identify whether the wrong data was trained into the field.

Classification Result and Confidence

See Classification and Confidence under Classification/Separation set.

Assigned Class
The document type assigned to the sample.

Knowledge bases

A knowledge base is the TotalAgility name for the Machine Learning model. The knowledge base is generated from the sample documents when you train the model. Therefore, the knowledge base/model contains the knowledge learned from these samples, but not the samples themselves.

Whenever you "train" a sample, TotalAgility creates an internal knowledge base and only uses it at runtime. The advantage is that many samples do not need to be deployed to the production system, and the confidential information in the original sample files is not compromised because they do not ever need to be published on other systems.

In addition to this internal knowledge base/model, you can also generate exportable extraction knowledge bases using the trainable locators. These knowledge bases can be password-protected. For example, as a partner, you could create a knowledge base that extracts data from certain document types that you trained it on, and then sell or give this knowledge base to customers, without having to share the sample documents.

Execution order

A TotalAgility project (Extraction Group or Shared Project) that uses Machine Learning for extraction can include multiple models. You can generate two internal knowledge bases when you "train" the system: one for generic extraction and the other for specific extraction. A TotalAgility project can also include knowledge bases that you create or import into a locator. Finally, it can include an additional internal knowledge base that is created during Online Learning when the system creates another dynamic model by training samples that were flagged by the operators.

When extracting documents, these models are executed in the following order:

Execution Order

  • If TotalAgility finds a confident result at one of these steps, execution stops.

  • TotalAgility executes the specific knowledge bases first. This means that TotalAgility recognizes the layout and knows where the data is located. Otherwise, it tries the more generic knowledge bases.

  • TotalAgility executes the internal knowledge base for the project, as this knowledge base is created and reviewed by an Administrator in Transformation Designer. This knowledge base is usually more reliable than the Online Learning knowledge because the operators creating it indirectly could have made mistakes or trained conflicting information that should have been first reviewed.