Crawl Pages

The Crawl Pages action loops through the pages of a web site. In effect, it crawls the web site one web page at a time. Hence, the first iteration crawls the first page, the second iteration crawls the second page, and so on.

Note The Crawl Pages step action only exists in the Classic browser, it cannot be used in Webkit.

The Crawl Pages action accepts a loaded page as part of the input, such as the start page of the web site. The output contains the next crawled web page.

Properties

The Crawl Pages action can be configured using the following properties:

Basic Tab

Crawling Strategy

This property specifies the strategy (i.e. method) of crawling. The Breadth First crawling strategy crawls a web site in order to minimize the page depth. The Depth First crawling strategy crawls a web site in order to maximize the page depth.

Maximum Depth

This property specifies the maximum depth of a page. The depth of a page is its distance from the first page measured in number of clicks and/or number of items that the mouse must be moved over (e.g. in a popup menu). The depth of the first page is zero. If a page exceeds the maximum depth, then it will not be crawled.

Ignore Pages with Errors

This property specifies whether pages with errors are skipped silently. Note that an error is only generated if this property is unchecked, and if the general options of the action do not specify that the particular type of error (e.g. JavaScript error or load error) should be ignored.

Options

The robot's options can be overridden with the step's own options. An option that is marked with an asterisk in the Options Dialog will override the one from the robot's configuration. All other options will be the same as specified for the robot.

Crawling Tab

Crawl these Windows

These properties specify which windows are crawled.

The starting point of the crawling is the current window and - if the Frames property is checked - its frames. Other top-level windows present at the start will only be crawled if the Popup Windows property is checked, and not until new pages have been loaded into them.

Frames

This property specifies whether frames are crawled.

Popup Windows

This property specifies whether popup windows are crawled. Popup windows are defined as top-level windows other than the window that was current window at the start of the crawling.

Click these Tags

These properties specify the HTML tags that the Crawl Pages action should attempt to click.

Links

Hyper links (A tags).

Buttons

Tags with input type="button", input type="submit" and input type="image" tags.

Image Maps

Images with client side image maps. Note that the image tag itself must be within the crawled area of the page, while the map need not.

Other Clickable Tags

Tags with JavaScript onClick event handlers.

Other
Automatically Handle Popup Menus

This property specifies whether to automatically include popup menus in the crawled area of the page. It only takes effect if a partial area of the page has been selected for crawling, either by setting up one or more tag finders for the first page or - for subsequent pages - by making a Crawling Rule with a Crawl Selected Parts of Page definition.

Move Mouse Over Tags

This property specifies whether the mouse should be moved over tags that support the relevant JavaScript event handlers (onMouseOver, onMouseEnter or onMouseMove). This is typically necessary for popup menus.

Rules Tab

The first page is handled specially: Whether it's output is determined by the Output the Input Page property on the Output tab. If only a particular area of the first page should be crawled, the area(s) are selected using tag finders on the Crawl Pages step.

For pages other than the first page, crawling rules can be set up.

Crawling Rules

Each crawling rule has the following properties:

Apply to these Pages

This property specifies a condition on the URLs of the pages to which this rule applies.

How to Crawl

This property specifies how the page should be crawled.

Crawl Entire Page

The entire page should be crawled.

Crawl Selected Parts of Page

Only parts of the page should be crawled. The included and excluded areas of the page are specified using tag finders, which can be advantageously copied from a step. If no included areas are specified, the entire page - except the specified excluded areas - is crawled.

Do Not Crawl

None of the page(s) should be crawled.

Output the Page

This property specifies whether the page should be output.

Rule Description

Here, you may specify a custom description of the crawling rule. This description will be shown in the list of crawling rules.

In the case that multiple rules apply to a given page, the last rule in the list that applies to the page overrides the preceding rules and takes effect. This provides an opportunity to e.g. first create a general rule, which states that all pages with the domain yourdomain.com should be crawled and then later add a specific rule, which states that the page http://yourdomain.com/uninteresting.html should not be crawled.

For all Other Pages

This property specifies how pages are handled. Excluded are the first page and pages with specific rules.

Crawl Entire Page

The entire page is crawled and output.

Do Not Crawl

The page is neither crawled nor output.

Crawl Only These Domains

This property specifies the domains that may be crawled. If left blank, all domains may be crawled. Multiple domains can be specified, separated by spaces

Note A specified page not crawl and not output will not be loaded if the link that points to it is an anchor or area tag with no JavaScript event handlers. If there are JavaScript event handlers involved, or if the page is loaded through JavaScript execution in general, you should be aware that it may be loaded anyhow. Still, it will not be output.

If at any time during the crawling one of the windows (be it a frame or a top-level window) should be output, all of the windows will be made available to the steps following the step with the Crawl Pages action.

Visited Pages Tab

Skip Already Visited Pages

This property specifies whether already visited pages should be skipped, which is usually the case. The following properties specify how visited pages are detected:

Detect Already Visited Pages by URL

This property specifies whether visited pages should be detected using their URL. For anchor tags with no JavaScript event handlers, this is done by checking the linked URL so the page will not be loaded a second time. In other cases (buttons, tags with JavaScript event handlers etc.) and for anchor tags with a non-visited linked URL, the resolved URL of the page is checked after it has been loaded.

Detect Already Visited Pages by Content

This property specifies whether visited pages should be detected by content. This ensures that pages with different URLs but identical content are not crawled again. For instance, http://www.yourdomain.com/ and http://www.yourdomain.com/index.html may point to the same page even though the URLs are different.

Output Tab

Output the Input Page

This property specifies whether the first page should be output. If enabled, the output of the first iteration (iteration 1) equals the input.

Output Page Again if Changed

This property specifies whether a given page should be output again if clicking or moving the mouse over some tag does not result in a page load. For instance, moving the mouse over an item that opens a popup menu will not result in a page load, so if you want to process the page with the popup menu visible, this property must be checked. Note that regardless of the value of this property, the page is always crawled again to detect any added tags.

Show Overview Page

This property specifies whether to open a new window showing an overview page. The overview page contains a list of the URLs from each step up to the current point of the crawling. The URLs of pages that were visited but not output are shown in gray.

Store Current Depth Here

This property specifies a variable into which the current depth is stored.

Store Current Path Here

This property specifies a variable into which the current path is stored. The elements of the path are separated by semicolon, where each element consists of a space-separated list of the URLs at the current point of the crawling