What are regular expressions and how to use them?
A regular expression, or regex for short, is a pattern describing a certain amount of text. For example if you want to make sure that the person filling a form submits exactly the text you require, you can use a regular expression to enforce a certain pattern of the input or to validate the data.
As an example, let's say that you want to ask for a UK National Insurance Number (NIN). It has a pattern of 2 characters, followed by 6 digits, follow by another character. A regular expression that corresponds to this pattern would be:
[a-z|A-Z]{2}[0-9]{6}[a-z|A-Z]
Here [a-z] means "any lower case character", [A-Z] means "any upper case character", | means or, so [a-z|A-Z] means "any character", {2} means "repeat twice" etc.
Some governments provide regular expressions for commonly used patterns. For example, see this Wikipedia article about UK postal codes.
For a quick cheat sheet on regular expressions, search for a 'Regular Expressions Guide' or read on.
Character | Legend | Example | Sample Match |
\d | Most engines: one digit | file_\d\d | file_25 |
\d | .NET, Python 3: one Unicode digit in any script | file_\d\d | file_9੩ |
\w | Most engines: "word character": ASCII letter, digit or underscore | \w-\w\w\w | A-b_1 |
\w | .Python 3: "word character": Unicode letter, ideogram, digit, or underscore | \w-\w\w\w | 字-ま_۳ |
\w | .NET: "word character": Unicode letter, ideogram, digit, or connector | \w-\w\w\w | 字-ま‿۳ |
\s | Most engines: "whitespace character": space, tab, newline, carriage return, vertical tab | a\sb\sc | a b |
\s | .NET, Python 3, JavaScript: "whitespace character": any Unicode separator | a\sb\sc | a b |
\D | One character that is not a digit as defined by your engine's \d | \D\D\D | ABC |
\W | One character that is not a word character as defined by your engine's \w | \W\W\W\W\W | *-+=) |
\S | One character that is not a whitespace character as defined by your engine's \s | \S\S\S\S | Yoyo |
Quantifiers
Quantifier | Legend | Example | Sample Match |
+ | One or more | Version \w-\w+ | Version A-b1_1 |
{3} | Exactly three times | \D{3} | ABC |
{2,4} | Two to four times | \d{2,4} | 156 |
{3,} | Three or more times | \w{3,} | regex_tutorial |
* | Zero or more times | A*B*C* | AAACC |
? | Once or none | plurals? | plural |
More Characters
Character | Legend | Example | Sample Match |
. | Any character except line break | a.c | abc |
. | Any character except line break | .* | whatever, man. |
\. | A period (special character: needs to be escaped by a \) | a\.c | a.c |
\ | Escapes a special character | \.\*\+\? \$\^\/\\ | .*+? $^/\ |
\ | Escapes a special character | \[\{\(\)\}\] | [{()}] |
Logic
Logic | Legend | Example | Sample Match |
| | Alternation / OR operand | 22|33 | 33 |
( … ) | Capturing group | A(nt|pple) | Apple (captures "pple") |
\1 | Contents of Group 1 | r(\w)g\1x | regex |
\2 | Contents of Group 2 | (\d\d)\+(\d\d)=\2\+\1 | 12+65=65+12 |
(?: … ) | Non-capturing group | A(?:nt|pple) | Apple |
More White-Space
Character | Legend | Example | Sample Match |
\t | Tab | T\t\w{2} | T ab |
\r | Carriage return character | see below |
|
\n | Line feed character | see below |
|
\r\n | Line separator on Windows | AB\r\nCD | AB |
\N | Perl, PCRE (C, PHP, R…): one character that is not a line break | \N+ | ABC |
\h | Perl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator |
|
|
\H | One character that is not a horizontal whitespace |
|
|
\v | .NET, JavaScript, Python, Ruby: vertical tab |
|
|
\v | Perl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator |
|
|
\V | Perl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace |
|
|
\R | Perl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v) |
|
|
More Quantifiers
Quantifier | Legend | Example | Sample Match |
+ | The + (one or more) is "greedy" | \d+ | 12345 |
? | Makes quantifiers "lazy" | \d+? | 1 in 12345 |
* | The * (zero or more) is "greedy" | A* | AAA |
? | Makes quantifiers "lazy" | A*? | empty in AAA |
{2,4} | Two to four times, "greedy" | \w{2,4} | abcd |
? | Makes quantifiers "lazy" | \w{2,4}? | ab in abcd |
Character Classes
Character | Legend | Example | Sample Match |
[ … ] | One of the characters in the brackets | [AEIOU] | One uppercase vowel |
[ … ] | One of the characters in the brackets | T[ao]p | Tap or Top |
- | Range indicator | [a-z] | One lowercase letter |
[x-y] | One of the characters in the range from x to y | [A-Z]+ | GREAT |
[ … ] | One of the characters in the brackets | [AB1-5w-z] | One of either: A,B,1,2,3,4,5,w,x,y,z |
[x-y] | One of the characters in the range from x to y | [ -~]+ | Characters in the printable section of the ASCII table. |
[^x] | One character that is not x | [^a-z]{3} | A1! |
[^x-y] | One of the characters not in the range from x to y | [^ -~]+ | Characters that are not in the printable section of the ASCII table. |
[\d\D] | One character that is a digit or a non-digit | [\d\D]+ | Any characters, inc- |
[\x41] | Matches the character at hexadecimal position 41 in the ASCII table, i.e. A | [\x41-\x45]{3} | ABE |
Anchors and Boundaries
Anchor | Legend | Example | Sample Match |
^ | Start of string or start of linedepending on multiline mode. (But when [^inside brackets], it means "not") | ^abc .* | abc (line start) |
$ | End of string or end of linedepending on multiline mode. Many engine-dependent subtleties. | .*? the end$ | this is the end |
\A | Beginning of string | \Aabc[\d\D]* | abc (string... |
\z | Very end of the string | the end\z | this is...\n...the end |
\Z | End of string or (except Python) before final line break | the end\Z | this is...\n...the end\n |
\G | Beginning of String or End of Previous Match |
|
|
\b | Word boundary | Bob.*\bcat\b | Bob ate the cat |
\b | Word boundary | Bob.*\b\кошка\b | Bob ate the кошка |
\B | c.*\Bcat\B.* | copycats |