RegEx: Regular Expressions

Creation date: 4/1/2022 2:43 PM    Updated: 5/18/2022 3:26 PM
Regular Expressions (RegEx) are used to identify patterns of text.  This article is a basic introduction to RegEx and addresses common RegEx you might use in Angelfish.

RegEx is used throughout the entire Angelfish application to perform tasks like:
  • matching a group of log files in a Datasource
  • identifying hits to exclude from processing via a Filter
  • isolate Pages in a subdirectory
  • specify the Hostname for a Profile

RegEx provides a flexible way to describe what the pattern looks like, using a combination of characters. 


QUICK REFERENCE


^   Caret: Match from the beginning of the field
$   Dollar: Match to the end of the field
.   Period: Match any single character
|   Pipe: OR
*   Asterisk: Match zero or more of the previous item
?   Question Mark: Match zero or one of the previous item
[]  Brackets: Match one item in this list
()  Parentheses: Match contents of parenthesis as item
+   Plus Sign: Match one or more of the previous item
\   Backslash: Escape symbol for any of the above characters
.*  Wildcard - select all
.+  Wildcard - only select non-empty string


ANCHORS: ^ $


Anchors match a specified pattern from the beginning or from the end of a field. The caret and dollar symbols are anchors.

The caret ^ matches a pattern from the beginning of the field. For example:

^/page/ matches the following:
  • /page/default.aspx
  • /page/comm/2021/files.html
  • /page/media/image.png

The following patterns are not matched:
  • /subsite/page/default.aspx  (/page/ not at beginning)
  • /pages/contact.html  (/page/ not at beginning)
 
The dollar symbol $ matches a pattern to the end of the field. For example:

internal.corp$ matches the following:
  • finance.internal.corp
  • media.internal.corp
  • home.internal.corp

The following patterns will not be matched:
  • finance.internal.dev.corp  (does not end in internal.corp)
  • home.internal.com  (does not end in internal.corp)

You can combine anchors in a single pattern - here's an example of a match pattern for a specific Username:
  • ^gfitz$

Common Use Cases for Anchors
  • Filters
  • Profile Config: Hostname(s), Results Page Stem
  • Report Search Field


RANGES: [] ()


RegEx is used to match individual characters, combinations of characters, and ranges of characters.

Brackets [] allow you to specify individual characters that appear in the string. Brackets look at each individual character, not whole strings.  

  • [agf] matches a or g or f
  • [0123] matches 0 or 1 or 2 or 3
 
Rather than typing individual characters, you can type a range in a bracket. For example:

  • [a-z] matches any lowercase letter
  • [A-Z] matches any uppercase letter
  • [0-9] matches any single number
  • [a-z0-9] matches any lowercase letter or number
  • [a-zA-Z0-9] matches any letter or number
  • [2-4x-z] matches 2 or 3 or 4 or x or y or z

Parentheses () allow you to match a string of characters in a specific order, like (blue) and (green). 

To match multiple strings, enclose them in parentheses and use a pipe | between each string.
 
  • (blue|green)
  • default.(aspx|html)$
  • ^/(page|image)/

Common Use Cases for Ranges
- Datasources
- Filters
- Report Search Field
 

QUANTIFIERS: ? + *


With RegEx, you can specify the number of times a pattern should occur.

A question mark ? after a character matches zero or one of the previous item, which makes the item optional.

^crawl? matches the following:
  • crawl
  • craw
  • crawfish  (the l is optional, making ^craw the match pattern)

(www\.)?website\.com$ matches the following
  • www.website.com
  • website.com

A plus sign + matches one or more of the previous item.

/+ matches the following slash patterns:
  • /
  • //
  • ////////

.+ is a wildcard that only matches a non-empty string

An asterisk * matches zero or more of the previous item.

.* is a wildcard that matches an empty or non-empty field.

Common Use Cases for Quantifiers
- Datasources  e.g. /logs/2022/u_ex220[1-6].*
- Filters
- Report Search Field


ESCAPE SYMBOL: \


Occasionally you'll want to match a character that has a RegEx value.  For example:

.com matches the following:
  • website.com
  • marcom.net  (the r is matched by the .)

The backslash \ allows you to escape the value of a RegEx character. 

Using the above example, you can escape the RegEx value of the period by adding a backslash, like this:
\.com

This forces a match pattern of ".com" (dot com) instead of "any single character followed by com"

To match a series of special characters in a row, escape each character individually.
  • $? is matched by \$\?

To match a single literal backslash, type two backslashes: the first backslash escapes the RegEx value of the second.

If you're unsure a character has a RegEx value or not, you can escape it with impunity.