Patterns

Click to watch a video on patterns and their use in Design Studio.

Hello. This video will take a closer look at regular expressions, also called patterns in Kofax Kapow. The first half of the video will be a lecture-like presentation of the syntax including wild cards, sets, subpatterns, repetition operators, alternate subpatterns, and subpattern references. The second half will go through three examples in Design Studio using patterns to create conditions and tag finders and to perform data conversion. If you are already familiar with regular expressions, you might want to skip directly to the examples. Answers to the problems given in the video can be found at the bottom of the page.

As mentioned earlier, regular expressions are called patterns in Kofax Kapow, and will be referred to as patterns for the remainder of this video.

The Wild Card

A pattern is a way to put a string of characters into more general terms by using symbols to represent strings of characters. You might be familiar with the concept of doing searches on your computer where it is sometimes possible to use a wild card symbol to represent any character. Doing a search for "ca*" (the asterisk here being a wild card), it might return both "cap", "car", "can", and so on. Patterns embrace the same concept while expanding to much more extensive syntax, which will be presented in this video.

Kofax Kapow uses the Perl5 syntax for its patterns. In this syntax, the wild card character is symbolized with '.' (a dot or a period) which corresponds to any single character including all symbols, white spaces, and any other special characters you could think of. This correspondence is called matching. For example, the pattern "ca." matches "cap", "car", "can", or any other string, which is a "c" followed by "a" followed by any single character. Similarly, the pattern ".a." matches "nap", "tan", "sad", or any other string characters, which has three characters and an "a" in the middle. However, it does not match "an" since each dot in the pattern has to match up against exactly one character. Similarly, it does not match "cans" since the pattern has to match the entire string, not just part of the string.

We can test whether a pattern matches a given string directly in Design Studio by using the Pattern Editor. The Pattern Editor can, for example, be found by inserting a Test Tag step into a robot and clicking Edit below the pattern field in the step action view. The Pattern Editor has three sections. At the top, it is possible to type in a pattern, the pattern is then matched to a string typed into the Input field on the left. Clicking the Test button or using the shortcut Ctrl+Enter will tell you whether the pattern matches the input.

Try typing ".a." in the Pattern field and "can" in the Input field. Then, use Ctrl+Enter. The Output field will now display "The pattern matches the input." We can ignore the rest of the output for now. If we on the other hand type just "an" in the input field and press Ctrl+Enter, we receive the message that "The pattern does not match the input." As I go over the pattern syntax, try to experiment with the Pattern Editor to test your understanding of the material.

Although not stated explicitly, we are now able to match in two different ways, either we can match character to character (such as, the pattern "a" matches the string "a") or we can use the wild card symbol "." to match any character. Additional direct character matching includes the ones listed in the table below.

Pattern

Matches the string

'\n'

A line break character.

'\r'

A carriage return character.

'\t'

A tab character.

'\.'

"."

'\\'

"\"

We can, for example, match a line break character (\n), a carriage return character (\r), a tab character (\t), a period (\.) or a backslash (\\).

Any other symbol used by the pattern syntax can also be explicitly matched by preceding it by a backslash.

Sets

We only want to match the character with one character in a set of characters. A set of characters is stated in a pattern by using '[]' (brackets). An example is '[abc]' (in other words, the set of a, b, c). This will match to either "a", "b" or "c" but will not match to any other characters than these three.

If you wish to include a range of characters to a set it can be done using a '-' (dash or hyphen). '[abc]' (the set of "a", "b", or "c") can therefore be written as '[a-c]' (the set of characters: a through c). Using words '[a-c]' means match any character in the range from "a" to "c". The two ways of defining sets can be combined to get something like '[a-dkx-z]' (the set of a through d, k, and x through z) which is similar to writing '[abcdkxyz]' (out all those characters in a set) or saying match any character which is either in the range "a" to "d", is "k", or is in the range "x" to "z".

It is also possible to define sets negatively by using '[^]' (a caret at the beginning of the set). An example is '[^a-c]' (the negative set of a through c) which will match any one character excluding "a", "b", and "c".

In the Pattern Editor, try using sets to match (1) any digit (2) any whitespace character or (3) anything that is not a digit. You can pause the video if you want to take a moment to think about these problems before seeing the answers.

There are certain shortcuts which can be used for sets that are often used. Here is the table showing some of the most important ones.

Shorthand form

Set

'\d'

'[0-9]' (Any digit)

'\D'

'[^0-9]' (Any non-digit)

'\s'

'[ \n\r\t]' (Any whitespace character)

'\S'

'[^ \n\r\t]' (Any non-whitespace character)

'\w'

'[a-zA-Z0-9_]' (Any word character)

'\W'

'[^a-zA-Z0-9_]' (Any non-word character)

We can, for example, match a number with a '\d', a non-number with a '\D', a whitespace character ('\s'), a non-whitespace character ('\S'), a word character ('\w'), and a non-word character ('\W'). In the middle column, you can see which set each shorthand form corresponds to.

Note The shorthand form can also be used inside a set. For example, '[\d\w]' is the set of all digits and all whitespace characters.
Subpatterns

Next we will need to talk about subpatterns within patterns. Terms we have talked about so far such as a character 'a', a set of characters '[abc]', an escaped character '\d' or the wildcard '.' can each be seen as a subpattern. Alternatively, we can create our own subpatterns by grouping together other subpatterns by using '()'. We could, for example, create a subpattern from '[ctb]an' by writing '([ctb]an)'.

It is important to recognize subpatterns since I will now be introducing some operators which work on the entire subpattern they succeed.

Repetition Operators

Operators in patterns allow us to match repetitions of a subpattern by following them with one of the operators given in the table.

Repetition Operator

Meaning

'{m,n}' where n ≥ m

Matches between m and n repetitions (inclusively) of the preceding subpattern.

'{m,}'

Matches m or more repetitions of the preceding subpattern.

The repetition operator is given on the left and the meaning of this is given on the right. With the first operator, we can match between m and n repetitions. With the second, we can match m or more repetitions.

For example, the pattern 'a{1,}' ('a' repeated more than once) would match the string "a", "aa", "aaa", or any number of repetitions of 'a'. This pattern '([bn]a){3,3}' is a bit more complex, but it would match 'banana', 'babana', 'nabana', or any other string of either "b" or "n" followed by "a" repeated three times. Try it out for yourself.

As for sets, there are also shorthand versions of the most useful repetition operators such as the exact number of repetitions, 0 or 1 instance of a subpattern, any number of repetitions of a subpattern, and finally 1 or more repetitions of a subpattern. Both shorthand and longhand versions are shown in this table.

Shorthand operator

Corresponds to

'{m}'

'{m,m}'

'?'

'{0,1}'

'*'

'{0,}'

'+'

'{1,}'

Try using what we have learned so far to match (4) anything (5) either "color" spelled without a "u" or "colour" spelled with a "u" (6) any four digit number.

One of the often used patterns is '.*' which matches anything: any string even if it is empty.

Now try extending this a bit. Find patterns that match (7) any text containing at least one digit (8) any text containing just one digit. Here is a list of the syntax you may need (video only).

The syntax used in the answers is very useful when matching specific subpatterns within a string.

Alternative Subpatterns

We discussed how to match alternative characters earlier, but what about matching alternative subpatterns? If we have N subpatterns 'p1' through 'pN' , we can match any one of these subpatterns using '(p1|p2|…|pN)' (parentheses and vertical bars separate the subpatterns). The pattern given here '(abc|a{5}|\d)' would, for example, match with either "abc", "aaaaa" or any number.

Try using alternative subpatterns to make a pattern that matches (9) a string which does not contain just one digit. Here, again, is the syntax you might need.

There is no not operator in the syntax, instead the answer uses two alternatives. The first alternative matches a string with no digits. The second matches any string containing at least two digits.

Subpattern References

The last major part of the syntax to cover is subpattern references. Any substring, "s1" through "sN", matched by a parenthesized subpattern, '(p1)' through '(pN)' in any one pattern, can be referenced to by using '\1' through '\N' where the subpattern is numbered in order from left to right as they are stated in the pattern. Matching, for example, this pattern '([chm])(at)' to "cat" we could use the reference '\1' to refer to "c" and '\2' to refer to "at".

The entire pattern can always be matched by '\0'.

Notice here that we are referring to the string matched by that subpattern rather than the subpattern itself. A reference to the subpattern '(abc)' would of course yield 'abc' whereas a reference to the subpattern '(\d)' would only match whatever digit was matched by the original subpattern.

As an example, consider matching a string containing a quote by using the pattern '.*(['"]).*\1.*' (anything followed by a single or double quote followed by anything followed by a reference followed by anything). This may look confusing but the only thing you really need to notice is that the reference will match the same type of quote which was matched by the subpattern. In other words, this pattern would match both the string He said "hello" with double quotes and He said 'hello' with single quotes. (I have purposefully not quoted the two strings here to avoid confusion.)

As I will show you later in Design Studio, subpatterns can also be referred to in certain expressions outside of patterns. This is useful when extracting certain parts of a matched string. Taking our quotes example, we could add parentheses around the subpattern enclosed by quotes '.*(['"])(.*)\1.*'. Now we are able to extract the quote in Design Studio.

Here is another problem. Try using subpattern references to match (10) four of the same digit (11) a string where at least two characters are the same.

Fewer Repetitions

When using subpattern references, it is handy to know the following. By default, the repetition pattern operators (*, +, {...}) will match as many repetitions of the preceding pattern as possible. You can put a "?" after a repetition operator to instead make it match as few repetitions as possible.

(12) Try matching a subpattern to the first occurrence of a digit in a string.

Removing '?' would result in matching the subpattern to the last occurrence of a digit in the string.

Using Patterns in Design Studio

Now that we have learned the syntax of patterns, it is time to look at the various use-cases in Design Studio.

Conditions

Creating conditions is the first way of using patterns intelligently in robots. The Test Tag step action is particularly relevant in this context, so let's go over a common use case.

Here, I have a robot which extracts from LinkedIn all engineering jobs they have listed for Denmark. The robot uses a loop to extract the URL, title, and company name from each job and return them to the user. But let's say I only want to extract from jobs which contain the words "Copenhagen" and "software", indicating that they are probably looking for a software engineer in Copenhagen.

First, I insert a new step after the For Each step and assign to it the Test Tag action. I ensure that the tag finder finds the entire job post of the current iteration of the loop. Then, I iterate through the loop until I find a job offering which matches the criteria I am about to set. This makes it easier to test that the pattern I write will actually work.

Going to the action tab in the step view, I first choose to match against text only (not the entire HTML), then I press edit on the pattern. I am now in the Pattern Editor and I can type a pattern to be matched. Since I do not know the order in which the two words "software" and "Copenhagen" might occur, I need to make two alternative subpatterns. In the first alternative, I have Copenhagen followed by anything followed by software. In the second alternative, I write the same but in reverse order. Finally, I add "any text" before and after the alternatives and press Ctrl+Enter to test whether the pattern matches. It matches!

I close the Pattern Editor and set the Test Tag step to Skip the Following Steps if the Pattern Does Not Match the Found Tag. This way the job post will be skipped if it does not contain the two words specified.

I now go ahead and run the robot in Debug Mode. As expected, only few results are extracted and they should all contain the two words Software and Copenhagen.

Tag Finders

Patterns can also be used in tag finders. This can be very useful if you know the structure of the information you are looking for but you do not know where on the page it is located. This robot, for example, goes to multiple different sites to extract the price of a certain pair of headphones. Since we cannot know where on the page to find the price, patterns play a crucial role in determining exactly this.

Let me show you how to set up the extraction step. I will delete the one I already have, insert a new step and choose for it the Extract action. To configure the step, I start by inserting a number converter which extracts the number from any text I might extract. Then, I choose to extract into the price attribute of the variable I have made for this robot.

Going to the Finders tab in the step view, I click the plus to add a Tag Finder. I locate the price on the page. I can see that it is secluded in its own tag, with nothing else in that tag. This is typical, so we will let our pattern match this case. In the Finders View, there is a field called Tag Pattern. Immediately, we can write the pattern '\$[\d\.]+' (dollar sign followed by one or more digits or dots). The pattern is designed to match any tag containing only a dollar sign followed by a decimal number. I click the magnifying glass in the upper right corner of the page view, which shows me what the Tag Finder finds. Unfortunately, it finds the cart balance instead of the headphone price. The cart balance will always be $0 for these kinds of sites, so to avoid this mistake, I will make sure that the first digit in my tag is not a zero. Fortunately, the steep price of headphones ensures that the price will never start with a zero. Rewriting the pattern, I get '\$[1-9][\d\.]+' (dollar sign followed by a digit which is not a zero followed by one or more digits or dots). This finds the correct price on the page when I click the magnifying glass.

Before testing the robot, I go to the error handling tab of the Extract step and choose to Ignore and Continue on error. If the Tag Finder fails to find the price on the page, it should just return the default value of the price attribute which I have set to -1. This gives me a clear indication that the robot was not able to find the price. Going to Debug Mode and looking at the results from an earlier execution of this robot, we see that many of the prices are extracted correctly. The method is of course flawed but it can be surprisingly effective at times.

Data Conversion

The final use for patterns is to convert data from one form into another. For this we can either use one of the data converter lists embedded in a step or use the dedicated Convert Variables step.

In this very simple example, I am extracting the author and date from a blog post. Unfortunately, the two pieces of information are contained by the same string of text and are therefore extracted collectively by the extract step. I will now show you how to separate these two pieces of information using patterns in data converters.

The extract step has a data converter list located in the step action view. The data converter list can be used to convert the extracted text before it is assigned to a variable. I click the plus and choose Extract to insert a data converter which can extract part of the string. A new window opens where I can configure the Extract data converter. At the top, there is a pattern, and at the bottom, there is a test input and a test output similar to those of the Pattern Editor. The idea with the Extract converter is to write a pattern which matches the entire input string, and then specify the subpattern to be extracted by using parentheses. By default, the entire string is matched and extracted, resulting in identical input and output strings.

If I want to exclude some of the extracted string, I just have to write it outside of the subpattern. Let me precede the subpattern with '.* by ' (any text followed by "space", b, y, "space"). Now the entire string is still matched, but only the name of the author will be part of the substring, and therefore the author's name will be extracted as shown in the Test Output field. The plain text ' by ' forces the two instances of '.*' (any text) to match the date and the author name respectively.

I can now close the configuration window and execute the extract step. The author name is now correctly assigned to my variable.

Let me go back to the extract step and quickly demonstrate another converter which uses patterns. I remove Extract and add the Advanced Extract converter instead. Then, I write the same pattern as I used before except that I make subpatterns out of both instances of '.*' (any text). The Test Output is now still the same as the Test Input. This is because Advanced Extract enables me to choose which subpattern I would like to extract by using subpattern references in the Output Expression field.

In expressions, subpattern references are made using the '$' symbol followed by the reference number. Right now, the expression refers to the entire matched pattern but if I change it to '$1' I only get the first subpattern, extracting the date, and if I write '$2' I only get the author name which is matched by the second subpattern.

Note that it is also possible to add text, combine subpatterns, and do simple string manipulation using the expressions field. For example, I could write an expression which recombines the two substrings but in reverse order. For more information on expressions click the question mark next to the expressions field.

Finally, I would also like to recommend the Replace Pattern data converter, which replaces instances of a specified pattern in a string.

Those were the final words on patterns. Feel free to review any parts of the video you found useful or go to help.kapowsoftware.com to find even more answers.

Answers to Problems

Problem Number

Answer

1

'[0-9]'

2

'[ \n\r\t]'

3

'[^0-9]'

4

'.*'

5

'colou?r'

6

'\d{4}'

7

'.*\d.*'

8

'\D*\d\D*'

9

'(\D*|.*\d.*\d.*)'

10

'(\d)\1{3}'

11

'.*(.).*\1.*'

12

'.*?(\d).*'