Suppose you have various types of documents that would make up three different classes and you want to teach the characteristics of these documents to the system. Since you also want to test how well the training process performed, at first you only feed a small portion of the full document set to the system, which you can expand later on, if needed. As the documents are very unlikely to fall outside of the three defined classes, it is recommended to create a Contained project.
- Create a new project by selecting .
Set project specifics in the New Project dialog (listed
in order of importance):
- Select Contained project.
- Specify English as language for the project.
- Select project location.
- Provide a name for the project.
- Click OK when you are done.
- There are three different methods for creating a class for your project. Create the first class by selecting Training set in Project Explorer and clicking Add class(es) by folder(s).
- Create the second class by clicking Create a new class button and rename it. With the newly renamed class selected in Project Explorer, click Add training documents button and navigate to the folder containing your files and select those you want to add to your class, then click OK.
- Create the third class by clicking Create a new class button and rename it (step 4). Alternately, select all the files you want to add to your class in file explorer and drag and drop them to the class in Main panel.
- You can load entire folders for classes, there is no need to select each file one by one in the Windows File Open dialog. Since it is faster to delete redundant files than to select the required ones beforehand, you can select multiple files through Shift + left click (mass selection) or Ctrl + left click (individual selection) right-click in Main panel and then select Remove. This does not delete any of your documents from their physical location, but only removes them from the selected classes within the application.
Train your training documents by clicking the Train
button above Main panel.
Bear in mind that if you make any changes in the Training set, for example deleting or moving a document in a class, you will need to re-train your document set, since the Test all test documents and Test selection only buttons become inactive, signaling the need for a new training process.
After the training is complete, you need to decide whether it delivers the desired result. For this you have to create test document sets and test them.
- Navigate to Project Explorer, click the Add documents button and select your test documents you want to add to the Test set. in
- Repeat step 8 with the second and third class as well. If you are not required to train / test from any particular set of documents, but only want to sort the available documents at a certain ratio (for example, 30%-70%), then it is much faster to load all of them as training documents and use random select to move the desired amount to the test documents. To do this, right-click anywhere in the training document list in Main panel and choose Random select and then adjust the -/+ slider to determine how much of your training files you want to move over to the Test set. After the desired amount is set, select Move to my test set from the same right-click menu.
- Test your test documents by clicking the Test all test documents button above Main panel.
- Check Test results section in Project Explorer to see the outcome.
- Check the Total error and Correct sections in Total statistics. Adjust the Confidence threshold slider to achieve the least number of LowConfidence / Misclassified values in the Outcome column.
- In case you have a significant amount of Misclassified results, select some of these documents and move them back to their training class and re-train them to see if any improvements occur in the Total statistics table. After the moving process, you need to re-train and re-test your project.