RegExp
A regular expression is a special string that describes a pattern to be used for matching or searching within other strings. They are also known as a regex or regexp, and in JavaScript we refer to RegExp when we mean the built-in Object type for creating and working with regular expressions.
You can think of regular expressions as a kind of mini programming language separate from JavaScript. They are not unique to JavaScript, and learning how to write and use them will be helpful in many other programming languages.
Even if you're not familiar with regular expression syntax (it takes some time to master), you've probably encountered similar ideas with wildcards. Consider the following Unix command:
ls *.txt
Here we ask for a listing of all files whose filename ends with the extension .txt. The * has a special meaning: any character, and any number of characters. Both a.txt and file123.txt would be matched against this pattern, since both end with .txt.
Regular expressions take the idea of defining patterns using characters like *, and extend it into a more powerful pattern matching language. Here's an example of a regular expression that could be used to match both common spellings of the word "colour" and "color":
colou?r
The ? means that the preceding character u is optional (it may or may not be there).
Here's another example regular expression that could be used to match a string that starts with id- followed by 1, 2, or 3 digits (id-1, id-12, or id-999):
id-\d{1,3}
The \d means a digit (0-9) and the {1,3} portion means at least one, and at most three. Together we get at least one digit, and at most three digits.
There are many special characters to learn with regular expressions, which we'll slowly introduce.
Declaring JavaScript RegExp
Like String or Array, we can declare a RegExp using either a literal or the RegExp constructor:
let regex = /colou?r/; // regex literal uses /.../
let regex2 = new RegExp('colou?r');
Regular expressions can also have advanced search flags,
which indicate how the search is supposed to be performed.
These flags include g (globally match all occurrences vs. only matching once),
i (ignore case when matching), and m (match across line breaks, multi-line matching) among others.
let regex = /pattern/gi; // find all matches (global) and ignore case
let regex2 = new RegExp('pattern', 'gi'); // same thing using the constructor instead
Understanding Regular Expression Patterns
Regular expressions are dense, and often easier to write than to read. It's helpful to use various tools to help you as you experiment with patterns, and try to understand and debug your own regular expressions:
Matching Specific Characters
-
\ ^ $ . * + ? ( ) [ ] { } |all have special meaning, and if you need to match them, you have to escape them with a leading\. For example:\$to match a$. -
Any other character will match itself.
abcis a valid regular expression and means match the letters abc. -
The
.means any character. For examplea.would matchab,a3, ora". If you need to match the.itself, make sure you escape it:.\.means a period followed by any character -
We specify a set of possible characters using
[]. For example, if we wanted to match any vowel, we might do[aeiou]. This says match any of the letters a, e, i, o, or u and would matchabut nott. We can also do the opposite, and define a negated set:[^aeiou]would match anything that is not a vowel. With regular expressions, it can often be easier to define your patterns in terms of what they are not instead of what they are, since so many things are valid vs. a limited set of things that are not. We can also specify a range,[a-d]would match any ofa, b, c, dbut notf, gorh. -
Some sets are so common that we have shorthand notation. Consider the set of single digit numbers,
[0123456789]. We can instead use\dwhich means the same thing. The inverse is\D(capitalD), and means[^0123456789](i.e., not one of the digits). If we wanted to match a number with three digits, we could use\d\d\d, which would match123or678or000. -
Another commonly needed pattern is any letter or number and is available with
\w, meaning[A-Za-z0-9_](all upper- and lower-case letters, digits 0 to 9, and the underscore). The inverse is available as\Wand means[^A-Za-z0-9_](everything not in the set of letters, numbers and underscore). -
Often we need to match blank whitespace (spaces, tabs, newlines, etc.). We can do that with
\s, and the inverse\S(anything not a whitespace). For example, suppose we wanted to allow users to enter an id number with or without a space:\d\d\d\s?\d\d\dwould match both123456and123 456. -
There are lots of other examples of pre-defined common patterns, such as
\n(newline),\r(carriage return),\t(tab). Consult the MDN documentation for character classes to lookup others.
Define Character Matching Repetition
In addition to matching a single character or character class, we can also match sequences of them, and define how many times a pattern or match can/must occur. We do this by adding extra information after our match pattern.
-
?is used to indicate that we want to match something once or none. For example, if we want to match the worddogwithout ans, but also to allowdogs(with ans), we can dodogs?. The?follows the pattern (i.e.,s) that it modifies, and indicates that it is optional. -
*is used when we want to match zero or more of something.number \d*would match"number "(no digits),"number 1"(one digit), and"number 1234534123451334466600". -
+is similar to*but means one or more.vroo+mwould match"vroom"but also"vroooooooom"and"vroooooooooooooooooooooooooooooooom" -
We can limit the number of matches to an exact number using
{n}, which means match exactlyntimes.vroo{3}mwould only match"vroooom". We can further specify that we want a match to happen matchnor more times using{n,}, or use{n,m}to indicate we want to match *at leastntimes and no more thanmtimes:\w{8,16}would match 8 to 16 word characters,"ABCD1234"or"zA5YncUI24T_3GHO"
Define Positional Match Parameters or Alternatives
Normally the patterns we define are used to look anywhere within a string. However, sometimes it's important to specify where in the string a match is located. For example, we might care that an id number begins with some sequence of letters, or that a name doesn't end with some set of characters.
-
^means start looking for the match at the beginning of the input string. We could test to see that a string begins with a capital letter like so:^[A-Z]. -
Similarly
$means make sure that the match ends the string. If we wanted to test that string was a filename that ended with a period and a three letter extension, we could use:\.\w{3}$(an escaped period, followed by exactly 3 word characters, followed by the end of the string). This would match"filename.txt"but not"filename.txt is a path". -
Sometimes we need to specify one of a number of possible alternatives. We do this with
|, as inred|green|bluewhich would match any of the strings"red","green", or"blue".
Using RegExp with Strings
So far we've discussed how to declare a RegExp, and also some of the basics of defining search patterns.
Now we need to look at the different ways to use our regular expression objects to perform matches.
-
RegExp.test(string)- used to test whether or not the given string matches the pattern described by the regular expression. If a match is made, returnstrue, otherwisefalse./id-\d\d\d/.test('id-123')returnstrue,/id-\d\d\d/.test('id-13b')returnsfalse. -
String.match(regexp)- used to find all matches of the givenRegExpin the sourceString. These matches are returned as anArrayofStrings. For example,'This sentence has 2 numbers in it, including the number 567'.match(/\d+/g)will return theArray['2', '567'](notice the use of thegflag to find all matches globally). -
String.replace(regexp, replacement)- used to find all matches for the givenRegExp, and returns a newStringwith those matches replaced by the replacementStringprovided. For example,'50 , 60,75.'.replace(/\s*,\s*/g, ', ')would return'50, 60, 75.'with all whitespace normalized around the commas. -
String.split(RegExp)- used to break the givenStringinto anArrayof sub-strings, dividing them on theRegExppattern. For example,'one-two--three---four----five-----six'.split(/-+/)would return['one', 'two', 'three', 'four', 'five', 'six'], with elements split on any number of dashes.
There are other methods you can call, and more advanced ways to extract data using RegExp, and you are encouraged to dig deeper into these concepts over time. Thinking about matching in terms of regular expressions takes practice, and often involves inverting your logic to narrow a set of possibilities into something you can define in code.