Benchmark document sets

A benchmark set is a group of documents that can be used for separation, classification, and extraction benchmarks. These documents usually are a proportion of the documents used for testing your project and do not include poorly scanned or hard to read documents. A benchmark set cannot be added, but you can convert a test set so that it changed to a benchmark set. A benchmark set differs from a test set because benchmark documents can have class assignments, and because of this class assignment, these documents are suitable for separation, classification, and extraction benchmarks.

Important When a test set is converted to a benchmark set, its reference is automatically attached to a project. This means that the next time that project is opened, the benchmark set and all of its documents are displayed in the Documents window. Similarly, if a benchmark set is converted back to a test set, the reference for the test set is removed from the project. This means that the next time the project is opened, the test set is listed in the Recent Documents list but it is not displayed in the Documents window.

For classification benchmarks, your document benchmark set needs the following:

  • Recognition results if you are using content classification

  • An assigned class

For separation benchmarks, your document benchmark set needs the following:

  • Recognition results if you are using content classification

  • An assigned class

  • When in the Hierarchy View, no subfolders can exist under the Root Folder

For extraction benchmarks, your document benchmark set needs the following:

  • Recognition results

  • An assigned class

  • Extraction results

  • Validated extraction results

A processed benchmark set is often referred to as a set of golden files.

The following information is important to know when selecting the golden files that are added to your benchmark set:

  • Separation benchmarks does not support PDF documents, so do not include these in your benchmark set if you are testing separation.

  • Documents with multiple pages need to be combined into a single image file. This simulates separation in production.

  • Any rotated documents should be correctly aligned.

  • Select typical documents rather than obscure examples for your project classes.

  • Select clean documents rather than those with blotches or dark areas that could interfere with recognition results.

  • All documents used need to belong to one of the project classes.

Golden files are used for separation benchmarks, classification benchmarks, and extraction benchmarks. You can: