Key steps for creating a Machine Learning project

The following are typical steps for creating a Machine Learning project in TotalAgility:

  1. Define the document types and fields.

  2. Provide samples and teach the system what document type they are, or show the system where the fields are on those samples.

    The first and second steps are pre-training steps used to train the system. These pre-training steps allow the system to classify/extract successfully even before the first document goes into production.

  3. Train the system. You can create a model from the training samples with the click of a button. This model (the “knowledge” learned from the samples, not the samples themselves) is used in production.

The result is a fully functional, production-ready project that is ready for use out of the box!

You can even skip steps 2 and 3 and rely entirely on Online Learning without pre-training the system.

See the tutorials for Classification and Extraction.

Online Learning is the process of learning during production from the corrections made by the operators. If you skip steps 2 and 3 above, operators initially see 0% classification and extraction accuracy and then need to manually classify and key the data. TotalAgility learns from those initial and subsequent changes and improves (re-trains) the model behind the scenes. Over time, fewer documents need to be manually classified and less data needs to be manually keyed in.

At the backend, two important steps occur:

  1. TotalAgility adds the documents to the model, rebuilds it and applies the new and improved model as quickly as possible, often already on the next job being processed.

  2. TotalAgility stores a copy of the training document so it can later be downloaded into the project by an Administrator.

The main difference between pre-training and Online Learning is in who does it and why.

The aim of the Administrator who pre-trains the system is to get good training data into the model.

The aim of the operators who cause Online Learning indirectly, is to get a document valid and correct quickly with very little emphasis on overall accuracy. As a result, they might make mistakes such as training documents with very bad image quality, picking the due date instead of the invoice date, or confusing the system by selecting a value in different locations because it is printed on the document multiple times.

TotalAgility deals with these different motivations in various ways. For example, on a specific document type, if one operator makes a mistake while others do not, TotalAgility ignores the outlier data when building its models during Online Learning.

If the operators provide "conflicting" data, TotalAgility flags the conflicts. For example, the invoice date may be printed twice on a document, or the number of operators who make mistakes may be the same as the number of operators who do not make mistakes.These conflicts must be resolved by an Administrator, and the model must be maintained periodically; otherwise, over time the model would deteriorate slightly.

We recommend that you initially perform "Online Learning maintenance" weekly and then less frequently over time as the project and the models mature. See Downloading and maintaining new samples and Advanced knowledge for more information on Online Learning maintenance.