Regular Expressions

A regular expression, regex is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings4F Source - Wikipedia.
EzeScan uses regexes for various tasks from very simple find to complex find/replace functions.

There are books and information available on the internet with regards to Regular Expressions, please refer to them for assistance in creating your required regex.

This section is not aimed at teaching you how to use regexes but to provide a bit of an insight on how regexes may be applied in EzeScan.
EzeScan uses both "Find" and "Find/ Replace" regexes.

There are examples of Regex replace examples in the appendices starting on page

Find (Limit) Regex

The Limit Regex option uses a regular expression to match and return specific data from the extracted data.
Find regexes are generally used on the field's "Value tab" (page ) as well as in the Discovery module's "Content Advanced Search" section (page ) and "Skip Content" section (page )

Input Text

This field provides the function of testing what happens when the "find regex" is run.
The operator can enter sample input text into this field in order to test what the output would look like in the Output Text box.
The browse button

allows the import of a test text file to run the test on

Use find regex

Ticking the box  is what will initiate the find regex function.
Clicking on the Editor button will launch a regex editor application; if one is installed on your PC.

It is anticipated that the user has a knowledge of how the associated regex editor works when using this function

Output text

When a test is run on the regex value the results are shown in the Output text field. The test runs against whatever is typed/imported into the Input text field.

Some Simple Find (Limit) Regex Examples

These are a few examples of a Find (limit) regex which will return a value from a text string based on the regex. Each example contains the regex, some test text and the result.

What the regex looks for a 9 digit number in the text string
Regex \b\d{9}\b
Input TextOur ABN number is 123456789.
Result 123456789
Notes:

This has been run on the field's value tab

What An example when using the Infor Pathway (property system) integration together with HPE Content Manager (EDRMS). A string is returned from Pathway which contains a value which has the CM container value in it. We need that container value to save the uploaded document into.
Regex

(?<=(Pathway Container|Pathway Description)\::)[^|]+

Input Text LAP/LAPAPPL/139446

_Pathway Link

Pathway Primary Key::LAP/LAPAPPL/139446

Pathway Container::

Pathway Description::19-COM, 20 Greenhill Road, WAYVILLE SA 5034
Result 19-COM, 20 Greenhill Road, WAYVILLE SA 5034
Notes:

This has been run on the field's value tab
There is also another setting on the value tab which does an extract item #1 using a comma delimiter.
The resulting value applied is 19-COM

What Need to return a value that had {PREFIX} before it and {SUFFIX} after it
Regex (?<={PREFIX})(.*)(?={SUFFIX})
Input Text{Prefix}Our ABN number{Suffix} is 123456789.
Result Our ABN number
Notes:

This has been run on the field's value tab
In this style {PREFIX} and {SUFFIX} are not inclusive meaning we need to match them, but they are not included in the result.
?<= is a positive look behind
?= is a positive look ahead.

Find / Replace Regex

The difference here is that the regex is expected to find values in a text string and replace it with something else.
This option allows to have multiple text entries to be replaced with other words. i.e. a result could have comau, com;au, com.ai. The Replace With feature can fix all these to show com.au

The regex string looks like this -"comau","com.au","com;au","com.au","com,au","com.au","com.ai","com.au"

It is a good way to replace simple things like the letter O with a zero 0 when OCR'ing numbers.

1234O66 will become 1234066
The regex string looks like this - "O","0"

It can also be very complex by locating a value in a block of text, like used in the EzeScan "Discovery" module to locate an invoice number on a scanned document and replacing it with just the invoice number

Invoice 12345 or Inv 12345 or Invoice: 12345 will become 12345
The regex string looks like this…

"(?<=^|\s)((inv(oice?)?|doc(ument)?|tax)(\.)? (n(o|br|umber)|#)?|(tax )?invoice) ?[•.,:; ]{0,4} *",""

A sample list of Find / Replace regexes have been provided on page with a larger set of examples included in the appendices on page .

Input Text	This field provides the function of testing what happens when the "find regex" is run. The operator can enter sample input text into this field in order to test what the output would look like in the Output Text box. The browse button allows the import of a test text file to run the test on
Use input replace regexes	Ticking the box  is what will initiate the find/replace regex function.
*Add your regex*	Click in the Replace field to add your regex "find value" and then in the With column to add your regex "replace value". For example: Replace comau With com.au - adds a full stop into a web address Replace ^([\|]+).*$ With $1 - Keeps the first value where it is delimited with two pipes PO1234	0	1 will become PO1234 If using a text editor to write your regex; use Notepad not MS Word as the double quotes used by MS Word will fail in the regex.MS Word uses ""Notepad uses "" More examples are available in the appendices on page and page
*Clear*	Will clear out all regex values in the Replace / With fields
*Copy*	Will copy the existing regex for use in other KFI field regexes <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="54145158-9935-4a5e-8551-71e5e170459b"><ac:plain-text-body><![CDATA[* Copies the regex into this format… "-(lu [0-9])",""," ([(][^()]^[)])","","([,]),([^{,]+)","$2$1","} *","" ]]></ac:plain-text-body></ac:structured-macro> Blue text is from the replace column & Red text is from the with column Column values begin and end in double quotes (") and are separated by a comma (,)
*Paste*	Pastes a copied regex into the Requires the format to be like the one shown in the above example When pasted it will appear like the example at right  Pasting a regex requires the formatting to be correct. It must be like this "Replace","With","Replace,"With","Replace","With" etc You must use a start and end quote on each value and separated by commas Blank values must be represented by two quotes "" Always test your pasted regex! Pasting a regex string into an existing regex will "append" the value to the end of the existing string. You may need to consider clearing the existing values first.
Output text	When a test is run on the regex value the results are shown in the Output text field. The test runs against whatever is typed/imported into the Input text field. Running a test on the "pasted" regex above should have this result… Input text - Flintstone, Fred (Mr) -lu 1660 Output text - Fred Flintstone

Some Simple Find / Replace Regex examples

What Keep a numeric value out of a string
Paste this

"^[^0-9](\d+)[^0-9]$",

"$1"
Input Text ONLINE JOB APPLICATION: (2700) Area Coordinator
Output Text2700

What Multiple text entries to be replaced by one value.In this example an OCR job has returned incorrect values for email addresses.We need com.au
Paste this "comau","com.au","com;au","com.au","com,au","com.au","com.ai","com.au"
Input TextWombat@email.comau
Output TextWombat@email.com.au

What Remove the $ symbol from a number (with space between $ and number)
Paste this "\$ *",""
Input Text $_space1.23
Output Text 1.23

What Remove the $ symbol from a number (with no space between $ and number)
Paste this "\$*",""
Input Text $1.23
Output Text 1.23

More Replace Regex Examples

The following Regex codes are rather simple and are commonly used when cleaning up items such as addresses, names etc. There are many websites available which you can visit to learn more about Regex coding and how to apply it to your own EzeScan configurations.

What	Example / outcome	Paste this	Blue = Replace value Red = With Value
Keep the last word in a sentence with a forward slash This regex will find the last word in a string which contains a forward slash. If no forward slash then it returns nothing.	Hello world I am here 1234/789 will become 1234/789; whereas Hello world I am here 1234789 will become nothing (blank)	"^.*?([/ ]/[^/ ])?$","$1"
This regex will remove leading 0's and also the first - e.g. from 0001-23456 - Smith, John	0001-23456 - Smith, John will become 123456 - Smith, John	"^0(\d)-(\d)( - .)","$1$2$3"
Add "$" as a prefix for if there is a value in the field. If there is no value in the field the field will remain blank	1 will become $1	"^(.+)$","$$$1"
Clear out the whole value if it ends with a \|*12345 will become NULL* 1234\5 will not change	"^.*+$","NULL"
Keep the first value where it is delimited with two pipes e.g. PO1234	0	1	*PO1234	0	1* will become PO1234	"^([\|]+).*$","$1"
Remove the first two characters out of a value	ABCDEF will become CDEF	"^..",""
Clear out the value if there is more than one character	ABCDEF will become blank Whilst A will remain the same	"^..+$",""
Remove any words that are in brackets	Smith (MR) will become Smith	" ([(][^()][)])",""
Replace the third and sixth character with a /	12112112 will become 12/12/12	"^(..).(..).","$1/$2/"
Remove multiple commas in a value	1234,5467,,4444 will become 1234,5467,4444	"(\d+)",",$1,","(,\d+,)(?=.*\1)","","^,\|,$","",",,+",","
Clear the value out if it is not numeric	ABC will become blankWhilst ABC123 in field will not change	"^[^0-9]+$",""
Keep the last two words of a value and remove the _ in between them	one_two_three_four will become three four	"^([]){2}","","_"," "
Search for a word (eg batch) & remove it. Will also remove spaces around the word.	a big batch of stuff will become a big of stuff	"\bbatch\b","","\s+"," ","(^\s+\|\s+$)",""
Remove the first word from a string and then add a second word	Taxation 2012 will become 2012 FY	".+ (20[0-9][0-9])","$1 FY"
Remove the second word from a string that is separated by a dash	Fred - was - here will become Fred - here	"[^]+",""
Take the last word in a string and put it in the front	Hello World will become World Hello	"^{(.) ([} ])$","$2 $1"
Clear out a value if it does not start with a date value in this format 99/99/99	Hello World 25/05/15 will become blankWhilst 25/09/15 will remain the same 25/09/15	"^{([0-9][0-9]/[0-9][0-9]/[0-9][0-9].*)$\|}().+$","$1"
Keep the second value in a string separated with a dash	18102594883 - Runners R US Pty Ltd will become Runners R US Pty Ltd	"^([]* - )?([^]+)$","$2"
Remove spaces in a dd/dd/dd type value where there can also be other words in the string (i.e. only removes the space in the date)	12 /12/12 Hello will become 12/12/12 Hello	"(\d\d) ?/(\d\d) ?/(\d\d)","$1/$2/$3"
Convert a HPE Content Manager KFI browse value to just output the first name and last name	Flintstone, Fred (Mr) -lu 1660 will become Fred Flintstone	"-(lu [0-9]+)","", " ([(][^()][)])","", "([^,]),([,])","$2$1","^ *",""
Convert a HPE Content Manager KFI browse value to just output only the last name	Flintstone, Fred (Mr) -lu 1660 will become Flintstone	" -(lu [0-9]+)","" ," ([(][^()][)])","" ,"([^,]),([,])","$1","^ * ",""
Convert a HPE Content Manager KFI browse value to just output only the first name	Flintstone, Fred (Mr) -lu 1660 will become Fred	" -(lu [0-9]+)","" ," ([(][^()][)])","" ,"([^,]),([,])","$2","^ * ",""
Change a numeric value with dashes in a string to remove them.	Hello World 99-999-999 will become Hello World 99999999	"(\d+)-(\d+)-(\d+)","$1$2$3"
Add spaces into a numeric value (eg ABN number formatting)	12345678901 will become 12 345 678 901	"(\d{2})(\d{3})(\d{3})(\d{3})","$1 $2 $3 $4"
Remove any words after the a particular word (eg service)	Car Service 15000 KLMS will become Car Service	"\b(service) .+$","$1"
Keep a numeric value out of a string	ONLINE JOB APPLICATION: (2700) Area Coordinator will become 2700	"^[^0-9](\d+)[^0-9]$","$1"
Extract the first numeric value after the word contact	hello world contact/1234 will become 1234	"^.?contact/(\d+).$","$1"
Clear out a string if it doesn't contain the value "contact" in it	hello world contact 1234 will stay as hello world contact 1234 Whilst hello world 1234 will become blank (NULL)	"^(.contact.)\|().+$","$1"
Add suffix a .00 if the value does not contain one	4 and 4.1 will become 4.00 but 4.10 will stay as 4.10	^(\d+)$,"$1.00","(\.\d)$","${1}0"
Crop the value to only keep the first 20 characters. NOTE - Change the value in {nn} to the required number - includes spaces	Crop the value to only keep will become Crop the value to on	"^(.{20}).*","$1"
Remove the word VIC and any words after it	1 Smith Street Melbourne VIC 3100 will become 1 Smith Street Melbourne	"^(.)\sVIC\s.$","$1"
Remove any of the STATES and any words after it	1 Smith Street Melbourne STATE Postcode etc will become 1 Smith Street Melbourne	"^(.)\sVIC\s.$","$1","^(.)\sNSW\s.$","$1","^(.)\sQLD\s.$","$1","^(.)\sSA\s.$","$1","^(.)\sTAS\s.$","$1","^(.)\sWA\s.$","$1","^(.)\sNT\s.$","$1"
Keep the last five characters from a value NOTE - Change the value in {nn} to the required number - includes spaces	1 Smith Street Melbourne will become ourne	"^.*(.{5})$","$1"
Remove a carriage return out of a value (adds a space)	1 Smith StreetMelbourneVIC3106 will become 1 Smith Street Melbourne VIC 3106	"[\r\n]+"," "
Remove a dash when there is no second number. For example there is start and end street address fields.	1 - 10 Smith St will remain the same 1 - 10 Smith St whilst 1 - Smith St will become 1 Smith St	"(\d+) - ([a-z]{3})","$1 $2"
Keeps anything to the right of a value that contains a string like "AAA/99/9999 - "	DA/123/2104 - Test would become Test	"^{.- ([}-]+)$","$1"
Keep anything after the last dash	one two three - four will become four	"^(.- )+",""
TRIMMING Text - To trim both ends of a string	_spaceBilly Blogs_space would become Billy Blogs	"^ +\| +$",""
TRIMMING Text - To trim just the start	_spaceBilly Blogs would become Billy Blogs	"^ +",""
TRIMMING Text - To trim just the end use (don't forget the space at the front of the pattern)	Billy Blogs_space would become Billy Blogs	" +$",""
Keep the last two words out of a string	one two three four will become three four	"^{.* ([} ]+ [^ ]+)","$1"
Add a dot after the first word in a string of words	one two three will become one.two three	"^([ .]) ([^ ])","$1.$2"
Suffix a 0 on the DD or MM component of a date	1/1/2015 will become 01/01/2015	"/(\d{2})$","/20$1","^(\d)/","0$1/","/(\d)/","/0$1/"
This regex will find the last word in a string which contains a forward slash. If no forward slash then it returns nothing	Hello world I am here 1234/789 will become 1234/789; whereas Hello world I am here 1234789 will become nothing	"^.*?([/ ]/[^/ ])?$","$1"
Remove the $ symbol from a number (with space between $ and number)	$_space1.23 will become 1.23	"\$ *",""
Remove the $ symbol from a number (with no space between $ and number)	$1.23 will become 1.23	"\$*",""
Remove the 1st value when separated by a hyphen	one-two or one - two will become two	"(^[]\s)",""
Clear out the value if it does not end in numeric	Words 123 will remain the sameWhilst Words will become Blank (Nothing)	"^.*[^0-9]$",""
Remove the last 2 digits and hyphen from a 11 number string	11-2222-3333-44 will become 11-2222-3333	"-d{2}$",""
Cleaning up too many hyphens in a word string	The big - - thing will become The big - thing	"\b - - - \b," - ","\b - - \b"," - "

Regex quick reference guide

A quick reference guide for some of the popular and most used regex metadata values.

This is not an exhaustive list but just one which contains some of the most used matadata values which are incorpoarted in regex scripts. It is strongly recommended to research more about regexes on the internet or speak to your local EzeScan representative for assistance.
Metacharacter
Description
.
Matches any single character except new line (\n). For example…

a.c matches "abc", etc.
but

[a.c]

matches only "a", ".", or "c".|
***
Matches the preceding element zero or more times. For example…
ab*c matches "ac", "abc", "abbbc", etc.
[xyz]*

matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on
(ab)* matches "", "ab", "abab", "ababab", and so on|
^
Is used at the start of a string, or start of line in multi-line pattern. For example…

can be used as a replace regex to strip the leading zeros from 000012 to leave 12|

$	Matches the ending position of the string or the position just before a string-ending newline
+	Identifies that there must be one or more of the preceding item
?	Add a ? to a quantifier to make it ungreedy
**	This is an escape character. This is in case you may need to remove a character that is used in regex codes. For example…

a forward slash in a date would need to be represented as 01\/01\/2016

to remove || at the end of a value you can't do ||$ - you need to do ||$|

( )	Defines a group. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group
\n	Matches what the *nth* marked subexpression matched, where n is a digit from 1 to 9
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="e2e3eae3-a2b0-4dfb-815a-99f0def3ed5e"><ac:plain-text-body><![CDATA[	[ ]	A bracket expression. Matches a single character that is contained within the brackets. For example… ]]></ac:plain-text-body></ac:structured-macro>

[abc]
matches "a", "b", or "c"
[a-z]
specifies a range which matches any lowercase letter from "a" to "z"
[A-Z]
specifies a range which matches any uppercase letter from "A" to "Z"
These can be mixed:

[abcx-z]
matches "a", "b", "c", "x", "y", or "z", as does

[a-cx-z]

<span style="color: #0000ff"><ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="5f4f8271-1a11-4b74-a8c2-1dfda5955f7d"><ac:plain-text-body><![CDATA[[0-7]

+matches+ any digit from "0" to "7" \\ \\ The *-* character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets:

[abc-]

,

[-abc]

\\ Note that backslash escapes are +not allowed{+}. \\ The \] character can be included in a bracket expression if it is the first (after the ^) character: \[\]abc\]|]]>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="30a70412-9ce0-48ce-98e4-99e2ff27e6ec"><ac:plain-text-body><![CDATA[

[^ ]

Matches a single character that is not contained within the brackets. For example…
]]></ac:plain-text-body></ac:structured-macro>

[^abc]
matches any character other than "a", "b", or "c"
[^a-z]
matches any single character that is not a lowercase letter from "a" to "z"
Likewise, literal characters and ranges can be mixed|
\A
Start of string
\Z
End of string
\<
Start of word
\>
End of word
\b
used to identify the start and end of a word (the word boundary). For example…
\bcar\b would find the word car in the text string "this is my car"|
\B
Not word boundary
\c
Control character
\s
White space (i.e. spaces between words) For example…
^(.)\sVIC\s.$ would remove the "VIC" and any text after it
1 Smith Street Melbourne VIC 3100 will become 1 Smith Street Melbourne|
\S
Not white space
\d
Digit
\D
Not digit
\w
Word
\W
Not word

\A	Start of string
\Z	End of string
\<	Start of word
\>	End of word
\b	used to identify the start and end of a word (the word boundary). For example…

\B	Not word boundary
\c	Control character
\s	White space (i.e. spaces between words) For example…