Rewrite Page

The Rewrite Page step action takes the HTML page located in the current window, extracts the HTML content of that page and any frames that it may have and additionally outputs the links to other pages as well as the URLs of images, style sheets and other resources that the page depends on. Later, the page can be viewed offline exactly as it was at the time of the extraction.

All JavaScript and event handlers will be removed from the extracted HTML because the extracted HTML represents the result obtained from having already loaded the page and its frames and executed any JavaScript that could generate additional content. All URLs of the page will be rewritten, first according to a user-specified transformation and then they will be converted relative URLs. URLs in inline style sheets will also be rewritten.

The external style sheets whose URLs are outputted by the step action should be run though the Rewrite Style Sheet step action which applies a similar transformation; rewriting URLs of imported style sheets and images referenced in the style sheet.

The Rewrite Page step action is intended to be used in robots that have an external controller that feeds URLs of pages, style sheets and other resources to be rewritten into the robot.

Related Step Actions

To create a quick offline snapshot of a page, the Make Snapshot step action can be used. It does not require that the robot is controlled by an external application but will - in a single step - download and save all necessary resources in the file system, forming a complete, stand-alone snapshot.

Unlike the Rewrite Page step action, the Make Snapshot step action does not preserve links between different snapshots and does not reuse shared resources between snapshots.

Note Execution of this step is controlled by the license key.

Properties

The Rewrite Page step action can be configured using the following properties:

Original Page URL

Specify the variable containing the original URL of the page in the current window. This is the URL that was used to load the page. Note that the current URL of the page may be different if the server redirected to a different page than the one that was requested.

Data Converters

The data converters that specify the transformation to perform on the URLs of the page. This can be used to specify the transformation from URL to a location in the file system. The data converters should output an absolute URL (which may be a file URL), which the step action will automatically convert to a URL that is relative to the original page URL. For advanced URL rewriting, we recommend the Convert Using JavaScript data converter.

Extracted Pages

The variable in which to store the extracted pages. The step action will extract the HTML of the page in the current window as well as HTML for each of the frames. This will be outputted in JSON format, which also contains both the original URL and rewritten URL for each of the pages. Only the main page will however have its original URL specified.

To load the JSON output into a window, use the Create Page step action with the name of the variable containing the JSON as its source of content. In the Options of the step, you may need to explicitly specify that the content type is JSON and that the encoding is UTF-8.

URLs

The variable in which to store the extracted URLs. The step action will extract the URLs of all pages, images, style sheets and other resources directly linked to by the page and its frames. Note that the style sheets and pages linked to may themselves contain URLs; these are not included in the list.

The URLs are outputted in JSON format, giving both the original URL as well as the absolute rewritten URL of each URL. Also, the type of URL is given, which is determined by the context in which the URL occurs - for instance, all URLs found in <IMG> tags are marked with type IMAGE.

The available types are:

PAGE

A link found in an anchor tag. Note that this does not imply anything about the content type of that page, as it has not yet been loaded.

IMAGE

An image.

STYLESHEET

An external CSS style sheet.

RESOURCE

A binary resource, for instance a PDF found in a frame or a Flash object.

To load the JSON output into a window, use the Create Page step action with the name of the variable containing the JSON as its source of content. In the Options of the step, you may need to explicitly specify that the content type is JSON and that the encoding is UTF-8.