Extract

The Extract action extracts text and stores it in a variable.

There is possibilities for specifying what content should be extracted such as only the text, or everything including the tags. Before the text is stored, it can be processed using a list of data converters, and optionally trimmed for leading and trailing spaces.

The simplest way to use the Extract action is to extract from a single found tag. It is also possible to extract from a tag range, i.e. all tags from one found tag to another found tag.

Properties

The Extract action can be configured using the following properties:

Extract From

Specifies the part of the found tag that will be extracted.

  • Found Tag specifies that the entire found tag should be extracted.
  • Tag Range specifies that a range of tags should be extracted. Begin and end tags and whether or not to include these tags in the range can be selected.
Extract This

Specifies what content should be extracted.

  • Only Text specifies that only the text should be extracted.
  • Structured Text specifies that only the text should be extracted, but that it should be structured similarly to how it would appear in a browser. The system can guess at the location of a heading, and insert text before and/or after. You can set the following options.
    Include Aligned Tables and Images

    Specifies that the tables and images that are aligned to the left or right of the text are included in the output text. Disabling this can sometimes result in removing the desired content.

    Include URLs

    Specifies that the actual URLs in link tags will be included in the output text.

    Include Image Text Alternatives

    Specifies that the text representation of images will be included in the output text.

    Include Form Fields

    Specifies that the text representation of form fields will be included in the output text.

    Insert This Before a Heading

    Specifies that this action should guess at the location of headings and insert the specified text before them.

    Insert This After a Heading

    Specifies that this action should guess at the location of headings and insert the specified text after them.

  • Advanced Structured Text specifies that only the text should be extracted, but that it should be structured similarly to how it would appear in a browser. Tag names can be converted into any text. You can set the following options.

    Include Aligned Tables and Images

    Specifies that the tables and images that are aligned to the left or right of the text are included in the output text. Disabling this can sometimes result in removing the desired content.

    Include URLs

    Specifies that the actual URLs in link tags will be included in the output text.

    Include Image Text Alternatives

    Specifies that the text representation of images will be included in the output text.

    Include Form Fields

    Specifies that the text representation of form fields will be included in the output text.

    Tag Conversions

    Specifies the tag conversions to use. A tag conversion is on the form tag=text. For instance "<h1>=<head1>" and "</h1>=</head1>" would convert HTML headings level 1 to special <head1>-tags. Please note that the right sides of the conversions can be anything, they need not be ordinary tags.

  • HTML specifies that the whole HTML should be extracted.

    Format HTML

    Specifies that the HTML should be pretty-printed.

    Encode URLs

    Specifies that URLs in attribute values should be HTML encoded. This is highly recommended, as it is necessary to generate standard compliant HTML that will work consistently across different browsers. In some cases when the HTML is to be subjected to simple processing for recognizing and comparing URLs it may, however, be necessary to leave the URLs unencoded.

    Extract Relative URLs

    Specifies that all URLs should be extracted as relative. Thus, if present, the base part of the URL is removed.

  • XML specifies that the whole XML should be extracted. This only works if the page is an XML page.

    Include XML Declaration

    Specifies that the XML Declaration (e.g. <?xml version="1.0" encoding="UTF-8"?>) should, if present, be included in the extracted XML. This mean that one may extract part of an XML document an get a new XML document with a proper declaration at the top.

Converters

An optional list of data converters that should process the text.

Trim Spaces

If selected, spaces at the start and end of the text will be removed before storing the text in the variable.

Variable

Specifies the variable in which to store the extracted text.