Regular Expressions (RegEx) are used to identify patterns of text. This article is a basic introduction to RegEx and addresses common RegEx you might use in Angelfish.
RegEx is used throughout the entire Angelfish application to perform tasks like:
- matching a group of log files in a Datasource
- identifying hits to exclude from processing via a Filter
- isolate Pages in a subdirectory
- specify the Hostname for a Profile
RegEx provides a flexible way to describe what the pattern looks like, using a combination of characters.
QUICK REFERENCE
^ Caret: Match from the beginning of the field
$ Dollar: Match to the end of the field
. Period: Match any single character
| Pipe: OR
* Asterisk: Match zero or more of the previous item
? Question Mark: Match zero or one of the previous item
[] Brackets: Match one item in this list
() Parentheses: Match contents of parenthesis as item
+ Plus Sign: Match one or more of the previous item
\ Backslash: Escape symbol for any of the above characters
.* Wildcard - select all
.+ Wildcard - only select non-empty string
ANCHORS: ^ $
Anchors match a specified pattern from the beginning or from the end of a field. The caret and dollar symbols are anchors.
The caret ^ matches a pattern from the beginning of the field. For example:
^/page/ matches the following:
- /page/default.aspx
- /page/comm/2021/files.html
- /page/media/image.png
The following patterns are not matched:
- /subsite/page/default.aspx (/page/ not at beginning)
- /pages/contact.html (/page/ not at beginning)
The dollar symbol $ matches a pattern to the end of the field. For example:
internal.corp$ matches the following:
- finance.internal.corp
- media.internal.corp
- home.internal.corp
The following patterns will not be matched:
- finance.internal.dev.corp (does not end in internal.corp)
- home.internal.com (does not end in internal.corp)
You can combine anchors in a single pattern - here's an example of a match pattern for a specific Username:
Common Use Cases for Anchors
- Filters
- Profile Config: Hostname(s), Results Page Stem
- Report Search Field
RANGES: [] ()
RegEx is used to match individual characters, combinations of characters, and ranges of characters.
Brackets [] allow you to specify individual characters that appear in the string. Brackets look at each individual character, not whole strings.
- [agf] matches a or g or f
- [0123] matches 0 or 1 or 2 or 3
Rather than typing individual characters, you can type a range in a bracket. For example:
- [a-z] matches any lowercase letter
- [A-Z] matches any uppercase letter
- [0-9] matches any single number
- [a-z0-9] matches any lowercase letter or number
- [a-zA-Z0-9] matches any letter or number
- [2-4x-z] matches 2 or 3 or 4 or x or y or z
Parentheses () allow you to match a string of characters in a specific order, like (blue) and (green).
To match multiple strings, enclose them in parentheses and use a pipe | between each string.
- (blue|green)
- default.(aspx|html)$
- ^/(page|image)/
Common Use Cases for Ranges
- Datasources
- Filters
- Report Search Field
QUANTIFIERS: ? + *
With RegEx, you can specify the number of times a pattern should occur.
A question mark ? after a character matches zero or one of the previous item, which makes the item optional.
^crawl? matches the following:
- crawl
- craw
- crawfish (the l is optional, making ^craw the match pattern)
(www\.)?website\.com$ matches the following
- www.website.com
- website.com
A plus sign + matches one or more of the previous item.
/+ matches the following slash patterns:
.+ is a wildcard that only matches a non-empty string
An asterisk * matches zero or more of the previous item.
.* is a wildcard that matches an empty or non-empty field.
Common Use Cases for Quantifiers
- Datasources e.g. /logs/2022/u_ex220[1-6].*
- Filters
- Report Search Field
ESCAPE SYMBOL: \
Occasionally you'll want to match a character that has a RegEx value. For example:
.com matches the following:
- website.com
- marcom.net (the r is matched by the .)
The backslash \ allows you to escape the value of a RegEx character.
Using the above example, you can escape the RegEx value of the period by adding a backslash, like this:
\.com
This forces a match pattern of ".com" (dot com) instead of "any single character followed by com"
To match a series of special characters in a row, escape each character individually.
To match a single literal backslash, type two backslashes: the first backslash escapes the RegEx value of the second.
If you're unsure a character has a RegEx value or not, you can escape it with impunity.