Regular expression (regex or regexp for short) is a special text string for describing a search pattern. Scanshare uses regexp in different function in order to create a match pattern amongst some values. Places where it is possible to use regexp in Scanshare are:
Question: it is possible to define a regexp to match a particular format for the question entered value;
Barcode recognition: it is possible to define a regexp to match a particular barcode to find on the document when multiple barcodes of the same type are present;
Condition: it is possible to define a regexp to validate a trigger condition value to;
Variables: it is possible to define a regexp to extract part of a workflow variable value;
Zone OCR: it is possible to define a regexp to validate a particular zone text.
Regexp is a very powerful language and it has got a specific syntax which needs to be respected in order to create a valid regular expression. Regexp is a standard language, so many information can be found everywhere about it.
Literal Characters
The most basic regular expression consists of a single literal character, e.g.: a. It will match the first occurrence of that character in the string. If the string is Jack is a boy, it will match the a after the J.
Special Characters
There are 11 characters with special meanings:
• the opening square bracket [,
• the backslash \,
• the caret ^,
• the dollar sign $,
• the period or dot .,
• the vertical bar or pipe symbol |,
• the question mark ?,
• the asterisk or star *,
• the plus sign +,
• the opening round bracket ( and the closing round bracket ).
These special characters are often called “metacharacters”.
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.
Non printable characters
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B). Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.
Character Classes
With a “character class”, also called “character set”, you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey.
Character Classes Abbreviations
• \d Match any character in the range 0 – 9
• \D Match any character NOT in the range 0 – 9
• \s Match any whitespace characters (space, tab etc.)
• \S Match any character NOT whitespace (space, tab)
• \w Match any character in the range 0 – 9, A – Z and a – z
• \W Match any character NOT the range 0 – 9, A – Z and a – z
Dot
In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter.
The dot matches a single character, without caring what that character is. The only exception are newline characters.
Anchors
Thus far, we have explained literal characters and character classes. In both cases, putting one in a regex will cause the regex engine to try to match a single character.
Anchors are a different breed. They do not match any character at all. Instead, they match a position before, after or between characters. They can be used to “anchor” the regex match at a certain position. The caret ^ matches the position before the first character in the string. Applying ^a to abc matches a. ^b will not match abc at all, because the b cannot be matched right after the start of the string, matched by ^.
Similarly, $ matches right after the last character in the string. c$ matches c in abc, while a$ does not match at all.
Word boundaries
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.
There are three different positions that qualify as word boundaries:
• Before the first character in the string, if the first character is a word character.
• After the last character in the string, if the last character is a word character.
• Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
Alternation
You can use alternation to match a single regular expression out of several possible regular expressions.
If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.
Optional items
The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches both colour and color.
You can make several tokens optional by grouping them together using round brackets, and placing the question mark after the closing bracket. E.g.: Nov(ember)? will match Nov and November.
You can write a regular expression that matches many alternatives by including more than one question mark. Feb(ruary)? 23(rd)? matches February 23rd, February 23, Feb 23rd and Feb 23.
Repetition
There are 3 characters used for the repetition:
• The question mark. It tells the engine to attempt match the preceding token zero times or once, in effect making it optional.
• The asterisk or star tells the engine to attempt to match the preceding token zero or more times.
• The plus tells the engine to attempt to match the preceding token once or more.
<[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes.
It is possible also to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.
You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.
Grouping
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire group. I have already used round brackets for this purpose in previous topics throughout this tutorial.
Note that only round brackets can be used for grouping. Square brackets define a character class, and curly braces are used by a special repetition operator.
The regex Set(Value)? matches Set or SetValue. In the first case, the first backreference will be empty, because it did not match anything. In the second case, the first backreference will contain Value.
Examples
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
Will match any IP address just fine, but will also match 999.999.999.999.
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$
Basic regexp to match an email address.
^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$
Matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators.