Extract Content from HTML

Design Studio has six step actions for extracting content from a tag in an HTML page:

  • The Extract action is used to extract text content from the tag, optionally including the HTML tags.
  • The Extract URL action is used to extract a URL from a tag attribute containing a URL, and make that URL absolute.
  • The Extract Tag Attribute action is used to extract the value of a tag attribute.
  • The Extract Target action is used to extract binary data such as images and PDF files, but it handles any kind of binary data.
  • The Extract Form Parameter action is used to extract a form parameter from a form URL in the found tag and then store its value in a variable.
  • The Extract Selected Option action is used to extract the selected option from a <select>-tag and then store it in a variable.

To reformat (or normalize) the extracted content, use the Extract and Extract Tag Attribute actions and configure data converters in the list.

There are two actions to extract data from various binary data formats, for example, PDF or Flash. These are different from the preceding actions in that they extract the data and produce an HTML page that contains the data in a structured form that lets your robot access the data. These actions are used in an initial step before the actual data extraction, in which you may loop over the produced HTML and extract text.

  • The Extract Text from PDF action is used to extract text from a PDF document contained as binary data in a selected attribute.
  • The Extract from Flash action is used to extract data from a Flash object in a found tag.