Paragraph detection

Paragraph detection is performed as part of recognition and takes advantage of character recognition. The detected paragraphs can then be used by a Classification Locator or in a script.

For example, you process large legal contracts and you need to check if they all have an indemnification clause, or an intellectual property clause, etc. Rather than reading the entire document, you can use paragraph detection and a Classification Locator to classify the paragraphs and then point your users to the clause that was classified. This can be done by populating a field with a single paragraph, or by populating a table with several paragraphs, whichever method meets your needs. This enables the operator to quickly access a clause or to find any documents where the clause is not detected.

Similarly, you receive a document that is issued by the government on a daily basis and contains a list of newly registered companies. When you process the document, you need to place each new company and its corresponding address into a database. The document is structured as a very dense 6-column PDF where each company is a small paragraph. You can now use paragraph detection in a script to populate your database.

Sentences are typically grouped into paragraphs by indenting the first sentence in a paragraph, by adding vertical white space between paragraphs, or by numbering paragraphs. Paragraph detection uses this typical behavior, yet it is also able to detect paragraphs in other situations, such as the following.

  • When a document has multiple columns.

  • When a paragraph contains different fonts and sizes of text.

  • When a paragraph contains text with various font effects, such as bold, italic, super and subscripts, text colors, and background text colors.

  • When there are images embedded in the text flow.

  • When a paragraph includes numbered or bulleted list.

Paragraph detection works best when a document is text-based, without images, tables, or other content that breaks up the document. The results of paragraph detection decrease as the complexity of a document layout increases.

In addition, there are a few other known issues with paragraph detection.

  • Right-to-left languages are not currently supported.

  • Text files are not supported

  • Since paragraph detection depends on the quality of character recognition, the following issues may occur when there are mistakes in the recognition results.

    • A word is not recognized correctly, so a paragraph may be split into several unwanted paragraphs, depending on the document layout.

    • A bullet is not recognized, so a bullet list item may be recognized as part of the previous paragraph.

    • A piece of noise is recognized as a bullet, so an extra paragraph is detected.

  • Paragraph detection is specialized to work with machine printed fonts. This means that handwritten fonts may have issues due to different proportions and shapes of symbols