Crawl Pages
The Crawl Pages action loops through the pages of a web site. In effect, it crawls the web site one web page at a time. Hence, the first iteration crawls the first page, the second iteration crawls the second page, and so on.
The Crawl Pages action accepts a loaded page as part of the input, such as the start page of the web site. The output contains the next crawled web page.
Properties
The Crawl Pages action can be configured using the following properties:
Basic Tab
- Crawling Strategy
-
This property specifies the strategy (i.e. method) of crawling. The Breadth First crawling strategy crawls a web site in order to minimize the page depth. The Depth First crawling strategy crawls a web site in order to maximize the page depth.
- Maximum Depth
-
This property specifies the maximum depth of a page. The depth of a page is its distance from the first page measured in number of clicks and/or number of items that the mouse must be moved over (e.g. in a popup menu). The depth of the first page is zero. If a page exceeds the maximum depth, then it will not be crawled.
- Ignore Pages with Errors
-
This property specifies whether pages with errors are skipped silently. Note that an error is only generated if this property is unchecked, and if the general options of the action do not specify that the particular type of error (e.g. JavaScript error or load error) should be ignored.
- Options
-
The robot's options can be overridden with the step's own options. An option that is marked with an asterisk in the Options Dialog will override the one from the robot's configuration. All other options will be the same as specified for the robot.
Crawling Tab
- Crawl these Windows
-
These properties specify which windows are crawled.
The starting point of the crawling is the current window and - if the Frames property is checked - its frames. Other top-level windows present at the start will only be crawled if the Popup Windows property is checked, and not until new pages have been loaded into them.
- Frames
-
This property specifies whether frames are crawled.
- Popup Windows
-
This property specifies whether popup windows are crawled. Popup windows are defined as top-level windows other than the window that was current window at the start of the crawling.
- Click these Tags
-
These properties specify the HTML tags that the Crawl Pages action should attempt to click.
- Links
-
Hyper links (A tags).
- Buttons
-
Tags with input type="button", input type="submit" and input type="image" tags.
- Image Maps
-
Images with client side image maps. Note that the image tag itself must be within the crawled area of the page, while the map need not.
- Other Clickable Tags
-
Tags with JavaScript onClick event handlers.
- Other
-
- Automatically Handle Popup Menus
-
This property specifies whether to automatically include popup menus in the crawled area of the page. It only takes effect if a partial area of the page has been selected for crawling, either by setting up one or more tag finders for the first page or - for subsequent pages - by making a Crawling Rule with a Crawl Selected Parts of Page definition.
- Move Mouse Over Tags
-
This property specifies whether the mouse should be moved over tags that support the relevant JavaScript event handlers (onMouseOver, onMouseEnter or onMouseMove). This is typically necessary for popup menus.
Rules Tab
The first page is handled specially: Whether it's output is determined by the Output the Input Page property on the Output tab. If only a particular area of the first page should be crawled, the area(s) are selected using tag finders on the Crawl Pages step.
For pages other than the first page, crawling rules can be set up.
- Crawling Rules
-
Each crawling rule has the following properties:
- Apply to these Pages
-
This property specifies a condition on the URLs of the pages to which this rule applies.
- How to Crawl
-
This property specifies how the page should be crawled.
- Crawl Entire Page
-
The entire page should be crawled.
- Crawl Selected Parts of Page
-
Only parts of the page should be crawled. The included and excluded areas of the page are specified using tag finders, which can be advantageously copied from a step. If no included areas are specified, the entire page - except the specified excluded areas - is crawled.
- Do Not Crawl
-
None of the page(s) should be crawled.
- Output the Page
-
This property specifies whether the page should be output.
- Rule Description
-
Here, you may specify a custom description of the crawling rule. This description will be shown in the list of crawling rules.
In the case that multiple rules apply to a given page, the last rule in the list that applies to the page overrides the preceding rules and takes effect. This provides an opportunity to e.g. first create a general rule, which states that all pages with the domain yourdomain.com should be crawled and then later add a specific rule, which states that the page http://yourdomain.com/uninteresting.html should not be crawled.
- For all Other Pages
-
This property specifies how pages are handled. Excluded are the first page and pages with specific rules.
- Crawl Entire Page
-
The entire page is crawled and output.
- Do Not Crawl
-
The page is neither crawled nor output.
- Crawl Only These Domains
-
This property specifies the domains that may be crawled. If left blank, all domains may be crawled. Multiple domains can be specified, separated by spaces
If at any time during the crawling one of the windows (be it a frame or a top-level window) should be output, all of the windows will be made available to the steps following the step with the Crawl Pages action.
Visited Pages Tab
- Skip Already Visited Pages
-
This property specifies whether already visited pages should be skipped, which is usually the case. The following properties specify how visited pages are detected:
- Detect Already Visited Pages by URL
-
This property specifies whether visited pages should be detected using their URL. For anchor tags with no JavaScript event handlers, this is done by checking the linked URL so the page will not be loaded a second time. In other cases (buttons, tags with JavaScript event handlers etc.) and for anchor tags with a non-visited linked URL, the resolved URL of the page is checked after it has been loaded.
- Detect Already Visited Pages by Content
-
This property specifies whether visited pages should be detected by content. This ensures that pages with different URLs but identical content are not crawled again. For instance, http://www.yourdomain.com/ and http://www.yourdomain.com/index.html may point to the same page even though the URLs are different.
Output Tab
- Output the Input Page
-
This property specifies whether the first page should be output. If enabled, the output of the first iteration (iteration 1) equals the input.
- Output Page Again if Changed
-
This property specifies whether a given page should be output again if clicking or moving the mouse over some tag does not result in a page load. For instance, moving the mouse over an item that opens a popup menu will not result in a page load, so if you want to process the page with the popup menu visible, this property must be checked. Note that regardless of the value of this property, the page is always crawled again to detect any added tags.
- Show Overview Page
-
This property specifies whether to open a new window showing an overview page. The overview page contains a list of the URLs from each step up to the current point of the crawling. The URLs of pages that were visited but not output are shown in gray.
- Store Current Depth Here
-
This property specifies a variable into which the current depth is stored.
- Store Current Path Here
-
This property specifies a variable into which the current path is stored. The elements of the path are separated by semicolon, where each element consists of a space-separated list of the URLs at the current point of the crawling