Configure and perform clustering

Clustering takes a set of unclustered documents and groups them together based on their layout, their content, or both. Depending on how your project is configured or what you want to do with the documents once clustered, several configuration options are available. The following table explains when to select various options on the Configure tab of the Clustering window.

You can organize a group of unknown documents by following these steps:

  1. Open the Documents window if it is not already open.
  2. Open or select the document set and subset that contains the unsorted documents.
    Tip Clustering is supported only for test sets, benchmark sets and the training sets.

    The documents in the selected document subset are displayed in the selected view.

  3. Right-click on the document subset and click Clustering.

    The Clustering window is displayed.

  4. In the General group, configure the following options:
    1. If your project has classes configured, select Use project class names.

      Selecting this option means that the class names are available as cluster labels during the clustering process. You can also add new cluster labels as needed.

    2. Optionally, select Cluster documents with no assigned class only if the selected document set has been used for other testing.

      Selecting this option ensures that you do not slow down the clustering process by including documents that are already in the Classification Set. Any discrepancies in the number of documents loaded for clustering versus the number of document selected could be because of this option. Clearing this option overwrites any existing classification results.

    3. From the Are most of the documents unstructured? list, select a clustering method.
      Tip If your documents are mainly unstructured documents, select Yes. Otherwise select No.
    4. If you selected Yes for the Are most of the documents unstructured? option, modify the Minimum cluster size value.

      The value for this option indicates the minimum number of documents required for a cluster to be valid. Increase this number if you know that each cluster has many documents.

    5. If you selected Yes for the Are most of the documents unstructured? option, select Recognition on-demand if there is a chance that some of the documents are missing recognition results.

      Selecting this option means that if a document is missing recognition results, they are generated on-demand. If there are many documents without recognition results, selecting this option can slow down the time it takes to perform the initial clustering step.

    6. If you selected No for the Are most of the documents unstructured? option, select a Minimum confidence value.
  5. Click Start Clustering.

    A progress bar is displayed showing the clustering process. When clustering is finished, the Identify tab is displayed so that you continue processing the documents.

  6. Identify the documents by assigning them to a cluster and ensuring that the cluster has a label.

    Three documents are displayed and depending on the configuration, you can assign the displayed document to a cluster by entering a new cluster label or by selecting an existing label. The documents displayed on this tab differ each time you click Continue Clustering, and after several iterations, the clustering process asks for confirmation of suggested clusters.

    1. For each displayed document, enter or select a cluster label from the Assign a cluster list.

      The document is confirmed and assigned to the labeled cluster. If any other documents are assigned to this same cluster using layout clustering, these are confirmed automatically. Documents using content clustering require manual confirmation in a later step.

    2. For each displayed document, click Confirm if the suggested cluster is correct. If the suggested cluster is incorrect, click Re-Assign to change the cluster.
    3. Regularly view the Statistics to view your progress.
    4. Optionally, select a Filter to view documents in one of the following categories:
      • No filter

      • Unconfirmed documents for labeled clusters

      • Documents in unlabeled clusters

      • Unclustered documents

    5. Click Continue Clustering processing the documents for each step.

      More documents are displayed until all documents are clustered. When clustering is finished, you are asked if you want to continue and the Review tab is displayed.

  7. Review the clustered documents.

    All of the documents in the selected document subset are displayed along with their cluster and status. This enables you to get an overall picture of your clusters. You also have another opportunity to make any changes before assigning a document to a cluster.

    1. For any documents in an unlabeled cluster, select a Cluster Label from the list.

      When a label is applied to a cluster, all documents in that cluster are assigned that same label. If the documents are clustered using layout clustering, this is done automatically. For documents clustered using content clustering, manual confirmation is required.

    2. If you have added a label for any unlabeled clusters, click Update Clustering.

      Clustering is performed and the documents that belong to the newly labeled cluster are updated accordingly.

    3. For any unclustered documents, select a Cluster Label from the list.
    4. For any unconfirmed documents, click Confirm if they are in the correct cluster. If they are in the wrong cluster, click Reassign to assign them to the correct cluster.
    5. Once all documents are confirmed, assigned to a cluster, and all clusters have a label, click Update Clustering.

      The Assign tab is loaded.

  8. Choose which of the clustered documents are assigned a cluster.
    1. Optionally, if you do not have your project hierarchy configured already, select Create project classes from cluster labels.

      This is an easy way to quickly configure a project hierarchy.

      Note This option is disabled if your project is read-only.
    2. Specify a Minimum content confidence.

      Documents with a content clustering confidence below this value are not included in the assign step.

    3. Specify a Minimum layout confidence.

      Documents with a layout clustering confidence below this value are not included in the assign step.

    4. Select Confirmed documents to include all documents that are confirmed in your assign step.
    5. Select Unonfirmed documents to include all documents that are not confirmed in your assign step.

      If you select this option, the unconfirmed cluster is assigned as the classification result, assuming it meets the minimum confidence criteria.

  9. Click OK.

    The Clustering window closes and the selected document subset is updated to reflect the cluster labels as their classification result. If Create project classes from cluster labels is selected, the Project Tree is updated with the new classes.