How to be happy using Regex

2020-05-12

post-thumb

Contents

Regex, motherfucker

In this article, we’ll make your eyes cry blood. With joy.

Regular expressions (regex or regexp) are extremely useful for extracting information from any text, searching for one or more matches of a specific search pattern (i.e., a specific string of ASCII or unicode characters).

The fields of application range from validation to parsing/replacing strings, including translating data to other formats and web scraping.

One of the most interesting features is that, once you’ve learned the syntax, you’ll be able to use this tool in (almost) all programming languages (JavaScript, Golang, Kotlin, Java, C#, Python, Perl, and many others) with very few differences regarding features and syntax versions.

Let’s start by examining some examples and explanations.

Basic Topics

Anchors — ^ and $

RegexResult
^HomerMatches any sequence that starts with Homer
Marge$Matches a string that ends with Marge
^Bart Simpson$Exact string match (starts and ends with Bart Simpson )
LisaMatches any string that contains the text Lisa

Quantifiers — * + ? and

RegexResult
abc*Matches a string that has ab followed by zero or more c
abc+Matches a string that has ab followed by one or more c
abc?Matches a string that has ab followed by zero or one c
abc{2}Matches a string that has ab followed by 2 c
abc{2,}Matches a string that has ab followed by 2 or more c
abc{2,5}Matches a string that has ab followed by 2 up to 5 c
a(bc)*Matches a string that has a followed by zero or more copies of the string bc
a(bc){2,5}Matches a string that has a followed by 2 up to 5 copies of the string bc

OR Operator — | or []

RegexResult
a(b|c)Matches a string that has a followed by b or c (and captures b or c)
a[bc]Same as above, but without capturing b or c

Character Classes — \d \w \s and .

RegexResult
\dMatches a single character that is a digit
\wMatches a word character (alphanumeric character plus underscore)
\sMatches a whitespace character (includes tabs and line breaks)
.Matches any character

Negated Characters

Use . with care, as often the class or negated character class (which we’ll cover next) are faster and more precise.

\d, \w and \s also have their negations with \D, \W and \S respectively.

For example, \D will perform the inverse match (matches a single non-digit character) compared to the one obtained with \d.

Escaping Literals

Regex power

To be taken literally, you must escape the characters ^ . [ $ ( ) | * + ? { \ with a backslash \, as they have special meaning. By escaping the character, you can find a $ before a digit .

You can also find non-printable characters, such as tabs \t, new lines \n, and returns \r.

Flags

A regex is usually written in this format /abc/, where the search pattern is delimited by two slash characters /. At the end, we can specify a flag with the following values (which can be combined) =

  • g (global) = does not return after the first match, restarting subsequent searches from the end of the previous match
  • m (multi-line) = when enabled ^ and $ will match the start and end of a line, instead of the whole string
  • i (insensitive) = makes the entire expression case-insensitive (for example /aBc/i would match AbC)

Intermediate Topics

Grouping and Capturing — ()

RegexResult
a(bc)Parentheses create a capturing group with the value bc
a(? =bc)*Using ? = we disable the capturing group
a(?<foo>bc)Using ?<foo> we put a name to the group
Regex Yoda

This operator is very useful when we need to extract information from strings or data using your preferred programming language. Any multiple occurrences captured by several groups will be exposed in the form of a classic array = we’ll access their values by specifying an index in the match result.

If we choose to name the groups we’ll use (like (?<foo>...)), we can retrieve the group values using the match result as a dictionary/map where the keys will be the name of each group.

Bracket Expressions — []

RegexResult
[abc]Matches a string that has a or b or c (same as a|b|c)
[a-c]Same as above
[a-fA-F0-9]A string that represents a single hexadecimal digit, case-insensitive
[0-9]%A string that has a character from 0 to 9 before a % sign
[^a-zA-Z]A string that does not have a letter from a to z or from A to Z . In this case, ^ is used as negation of the expression

Remember that inside bracket expressions all special characters (including the backslash \) lose their special powers = therefore, we won’t apply the escape rule.

Greedy and Lazy Matching

Regex death

The quantifiers * + {} are greedy operators, so they expand the match as far as possible through the provided text. For example, <.+> matches <div>simple div</div> in This is a test of <div>simple div</div>. To capture only the div tag, we can use a ? to make it lazy =

<.+?> matches any character one or more times included within < and >, expanding as needed

Note that a better solution should avoid using . in favor of a stricter regex =

<[^<>]+> matches any character , except < or > one or more times included within < and >


Advanced Topics

Boundaries - \b and \B

\babc\b matches a whole words only search

\b represents an anchor like the caret (it is similar to $ and ^) matching positions where one side is a word character (like \w) and the other side is not a word character (for example, it could be the beginning of the string or a space character)

It has a negation, \B. That is, the result of all positions where \b does not match and could be if we want to find a search pattern fully surrounded by word characters.

Back-references — \1

RegexResult
([abc])\1Using \1, it matches the same text that was matched by the first capturing group
([abc])([de])\2\1We can use \2 (\3, \4, etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group
(?<foo>[abc])\k<foo>We put the name foo on the group and reference it later (\k<foo>). The result is the same as the first regex

Look-ahead and Look-behind — (?=) and (?<=)

e(?=r) Matches e only if followed by r, but r will not be part of the overall match of the regular expression (?<=r)i Matches i only if preceded by r, but r will not be part of the overall match of the regular expression

RegexResult
e(?=r)Matches e only if followed by r, but r will not be part of the overall match of the regular expression
(?<=r)iMatches i only if preceded by r, but r will not be part of the overall match of the regular expression

And when using the negation operator =

RegexResult
e(?!r)Matches e only if not followed by r, but r will not be part of the overall match of the regular expression
(?<!c)iMatches i only if not preceded by c, but c will not be part of the overall match of the regular expression

Useful Tools

Regex, Neo
  • Regex101 is a website where you can visualize your regexes in a didactic way. Besides being able to export code snippets for your favorite language.

  • Making tables in markdown (as this site is written) is complicated, but thanks to Markdown Tables the task becomes easier


Conclusion

The power of regex cannot be underestimated, young grasshopper. The fields of application of regex can be multiple and I’m sure you’ve noticed at least one of these tasks during your developer career =

  • data validation = for example, check if a date or email are valid
  • data scraping = especially web scraping, find all pages that contain a certain set of words, eventually in a specific order
  • data wrangling = transform “raw” data into another format
  • string parsing = for example, grab all URL GET parameters, capture text inside a set of parentheses
  • string replacement = replace , with ;, convert to lowercase, etc.
  • syntax highlighting, file renaming and many other things involving strings

UPDATE = I wrote a new article with the most used regexes, which you can check here = Useful Regex for your daily life