2020-05-12


In this article, we’ll make your eyes cry blood. With joy.
Regular expressions (regex or regexp) are extremely useful for extracting information from any text, searching for one or more matches of a specific search pattern (i.e., a specific string of ASCII or unicode characters).
The fields of application range from validation to parsing/replacing strings, including translating data to other formats and web scraping.
One of the most interesting features is that, once you’ve learned the syntax, you’ll be able to use this tool in (almost) all programming languages (JavaScript, Golang, Kotlin, Java, C#, Python, Perl, and many others) with very few differences regarding features and syntax versions.
Let’s start by examining some examples and explanations.
| Regex | Result |
|---|---|
| ^Homer | Matches any sequence that starts with Homer |
| Marge$ | Matches a string that ends with Marge |
| ^Bart Simpson$ | Exact string match (starts and ends with Bart Simpson ) |
| Lisa | Matches any string that contains the text Lisa |
| Regex | Result |
|---|---|
| abc* | Matches a string that has ab followed by zero or more c |
| abc+ | Matches a string that has ab followed by one or more c |
| abc? | Matches a string that has ab followed by zero or one c |
| abc{2} | Matches a string that has ab followed by 2 c |
| abc{2,} | Matches a string that has ab followed by 2 or more c |
| abc{2,5} | Matches a string that has ab followed by 2 up to 5 c |
| a(bc)* | Matches a string that has a followed by zero or more copies of the string bc |
| a(bc){2,5} | Matches a string that has a followed by 2 up to 5 copies of the string bc |
| Regex | Result |
|---|---|
| a(b|c) | Matches a string that has a followed by b or c (and captures b or c) |
| a[bc] | Same as above, but without capturing b or c |
| Regex | Result |
|---|---|
| \d | Matches a single character that is a digit |
| \w | Matches a word character (alphanumeric character plus underscore) |
| \s | Matches a whitespace character (includes tabs and line breaks) |
| . | Matches any character |
Use . with care, as often the class or negated character class (which we’ll cover next) are faster and more precise.
\d, \w and \s also have their negations with \D, \W and \S respectively.
For example, \D will perform the
inverse match (matches a single non-digit character)
compared to the one obtained with \d.

To be taken literally, you must escape the characters ^ . [ $ ( ) | * + ? { \ with a backslash \, as they have special meaning. By escaping the character, you can find
a $ before a digit
.
You can also find non-printable characters, such as tabs \t, new lines \n, and returns \r.
A regex is usually written in this format /abc/, where the search pattern is delimited by two slash characters /. At the end, we can specify a flag with the following values (which can be combined) =
/aBc/i would match AbC)| Regex | Result |
|---|---|
| a(bc) | Parentheses create a capturing group with the value bc |
| a(? =bc)* | Using ? = we disable the capturing group |
| a(?<foo>bc) | Using ?<foo> we put a name to the group |

This operator is very useful when we need to extract information from strings or data using your preferred programming language. Any multiple occurrences captured by several groups will be exposed in the form of a classic array = we’ll access their values by specifying an index in the match result.
If we choose to name the groups we’ll use (like (?<foo>...)), we can retrieve the group values using the match result as a dictionary/map where the keys will be the name of each group.
| Regex | Result |
|---|---|
| [abc] | Matches a string that has a or b or c (same as a|b|c) |
| [a-c] | Same as above |
| [a-fA-F0-9] | A string that represents a single hexadecimal digit, case-insensitive |
| [0-9]% | A string that has a character from 0 to 9 before a % sign |
| [^a-zA-Z] | A string that does not have a letter from a to z or from A to Z . In this case, ^ is used as negation of the expression |
Remember that inside bracket expressions all special characters (including the backslash \) lose their special powers = therefore, we won’t apply the escape rule.

The quantifiers * + {} are greedy operators, so they expand the match as far as possible through the provided text.
For example, <.+> matches <div>simple div</div> in This is a test of <div>simple div</div>. To capture only the div tag, we can use a ? to make it lazy =
<.+?> matches any character one or more times included within < and >, expanding as needed
Note that a better solution should avoid using . in favor of a stricter regex =
<[^<>]+> matches any character , except < or > one or more times included within < and >
\babc\b matches a whole words only search
\b represents an anchor like the caret (it is similar to $ and ^) matching positions where one side is a word character (like \w) and the other side is not a word character (for example, it could be the beginning of the string or a space character)
It has a negation, \B. That is, the result of all positions where \b does not match and could be if we want to find a search pattern fully surrounded by word characters.
| Regex | Result |
|---|---|
| ([abc])\1 | Using \1, it matches the same text that was matched by the first capturing group |
| ([abc])([de])\2\1 | We can use \2 (\3, \4, etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group |
| (?<foo>[abc])\k<foo> | We put the name foo on the group and reference it later (\k<foo>). The result is the same as the first regex |
e(?=r) Matches e only if followed by r, but r will not be part of the overall match of the regular expression (?<=r)i Matches i only if preceded by r, but r will not be part of the overall match of the regular expression
| Regex | Result |
|---|---|
| e(?=r) | Matches e only if followed by r, but r will not be part of the overall match of the regular expression |
| (?<=r)i | Matches i only if preceded by r, but r will not be part of the overall match of the regular expression |
And when using the negation operator =
| Regex | Result |
|---|---|
| e(?!r) | Matches e only if not followed by r, but r will not be part of the overall match of the regular expression |
| (?<!c)i | Matches i only if not preceded by c, but c will not be part of the overall match of the regular expression |

Regex101 is a website where you can visualize your regexes in a didactic way. Besides being able to export code snippets for your favorite language.
Making tables in markdown (as this site is written) is complicated, but thanks to Markdown Tables the task becomes easier
The power of regex cannot be underestimated, young grasshopper. The fields of application of regex can be multiple and I’m sure you’ve noticed at least one of these tasks during your developer career =
UPDATE = I wrote a new article with the most used regexes, which you can check here = Useful Regex for your daily life