Date Extraction - Simple Case

Click to watch a video showing how to modify the News Magazine robot to extract article dates.

Thank you for using Kofax Kapow.

This tutorial will show you how to easily extract dates from any website using the Date Extraction step in your robot. The tutorial consists of two parts. This first part is a follow-along tutorial with a simple example, the second part is a real case scenario showing a trickier example.

Please feel free to follow along on your computer for this first part.

If you have completed the beginner tutorials, you might remember the News Magazine robot. It extracts the most recent stories from News Magazine which is a site made for tutorial purposes. We want to modify the robot to also extract the date and time from the most recent articles.

Start Design Studio and open the Type called post from the Beginner Tutorials folder in the default project. Add a new attribute called date and give it the type Date. Now, save the Type, close the Type Editor and open the NewsMagazine robot from the same folder.

Click the Return Value step in the robot view. The robot will now execute to this step. I am going to close the projects view and the source view to get some more space to work with.

Scroll down in the browser view and locate the date and time given above each picture. We are going to add a step which extracts the date and time from each of the three articles.

To do this, we right click the tag containing time and date and select Extract > Extract Date > choose the variable post.date.

A new window opens which will help us extract the date into the standard date format which is the only format accepted by the simple type Date and thus by our post.date variable.

Let's look at the important parts of Extract Date Configuration window. The field Test Input shows the raw text as extracted. In the field called Pattern, we will write the pattern of the date exactly as it is stated in the extracted text. Right now the pattern is "dd MM yyyy" which means that the date consists of date, month and year, separated by spaces.

We see that this corresponds to the format of the date in the extracted text. Design Studio has deduced the pattern for us so if we just want to extract the date we do not have to modify anything. The field called Test Output shows us the date in the standard date format as extracted from the Test Input.

Let's say that we also want to extract the exact time that each article was published. This means that we have to expand our pattern to also capture this information. Keeping an eye on the Test Input, we add " * hh:mm" to the Pattern. Notice that the Test Output changes to incorporate the time of day.

Let me explain the pattern we added. Spaces and the colon correspond to their respective characters in the Test Input. Asterisk corresponds to "at" in the Test Input but can in general be used to represent any number of non-whitespace characters. "hh" means hours and "mm" means minutes. To get a full list of things to put in the Pattern field, you can click the question mark at the top right of the window.

Click OK. We have now successfully extracted the time and date from each article. Run the robot in Debug Mode to confirm this.