Understanding Regex 101
Learn Regex and its basic methods while implementing brackets, flags, and quantifiers
Oct 16, 2019 · 10 min read
Intro to Regular Expressions
As a software developer, you’ve probably encountered regular expressions several times and were confused when seeing this daunting set of characters grouped together like this:
[code class=”php”]<span id="762f" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">/^w+([.-]?w)+@w+([.]?w)+(.[a-zA-Z]{2,3})+$/</span>[/code]
And you may have wondered what this means…
Regular expressions (Regex or RegExp) are extremely useful in stepping up your algorithm game and will make you a better problem solver. The structure of regular expressions can be intimidating at first, but it is very rewarding once you grasp the patterns and implement them in your work properly.
What is Regex and why is it important?
A Regex, or regular expression, is a type of object that is used to help you extract information from any string data by searching through text to find what you need. Whether it’s numbers, letters, punctuation, or even white space, Regex allows you to check and match any character combination in strings.
For example, let’s say you needed to match the format of a social security number or email address. You can utilize Regex to check for patterns in the text strings and use it to replace or validate another substring. Think of Regex as your own search bar — it gives you the freedom to define your own search criteria for a pattern that fits your needs and assists you in finding what you were looking for.
Two Ways To Create A Regular Expression
- Regular Expression Literal: To create a regular expression literal, you start and end with forward slashes ( /) to enclose the Regex pattern.
[code class=”php”]<span id="029a" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const regexLiteral = /helloworld/;</span>[/code]
Syntax: /pattern/flags
2. Regular Expression Constructor: For a RegExp constructor, this method builds the expression for you.
[code class=”php”]<span id="8c34" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const greeting = ‘hello’</span><span id="dd20" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const regexConstr = new RegExp(greeting);</span>[/code]
Syntax: new RegExp(pattern[, flags])
Rule of thumb: If your regular expression is constant and does not change its value, you should use the Regex literal for better performance. In cases where it is dynamic and not a literal string (i.e., an expression), it is best to use the Regex constructor (see examples above).
Regular Expression Methods
There are three common Regex methods that you should be familiar with: test, match, and replace.
RegExp.prototype.test()
This .test method returns a boolean — checking if the string contains a match or no match in the search pattern.
[code class=”php”]<span id="8bb4" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const str1 = "i love regex";</span><span id="e33f" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const str2 = "it is cool";</span><span id="9b0f" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const hasRegex = /regex/;</span><span id="d22c" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">hasRegex.test(str1);</span><span id="1999" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: true</span><span id="5a6c" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">hasRegex.test(str2);</span><span id="0055" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: false</span>[/code]
String.prototype.match()
Now instead of using RegExp.test(String) which just returns a boolean if the pattern is matched, you can use the match method. This method returns an array with the whole matched string. Though it’s great to have the test method check whether the pattern is true or not, there will be times where we want to be in control of actually doing the match.
That’s where the .match method comes in handy! It returns an array of the match which can be helpful information depending on your use case. Here is a very basic example below. Later on you will see that when combined with flags, match becomes a powerful tool.
[code class=”php”]<span id="0dd6" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const str = "I love JavaScript";</span><span id="8033" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const result = str.match(/JavaScript/);</span><span id="a01a" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">console.log(result)</span><span id="1ed4" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: [‘JavaScript’]</span>[/code]
String.prototype.replace()
This .replace method searches for a string for a specified value (or regular expression) and returns a new string where the specified value is replaced.
[code class=”php”]<span id="14fc" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const sentence = ‘I love dogs more than cats.’;</span><span id="9cba" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const regex = /dogs/;</span><span id="c62a" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">console.log(sentence.replace(regex, ‘bunnies’));</span><span id="3365" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: "I love bunnies more than cats."</span>[/code]
NOTE: You CANNOT replace multiple instances of a word using a regular value, but you CAN do this with Regex.
[code class=”php”]<span id="cc0e" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const str = "Hello World World!";</span><span id="a978" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const replacement = str.replace("World", "Planet");</span><span id="9ed1" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">console.log(replacement)</span><span id="5740" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: “Hello Planet World!”</span>[/code]
Bracket Expressions
Inside the bracket expressions, you can place any special characters you want to use to specify the character sets.
For example, [code class=”php”]const regex=/[A-Z]/[/code]. Notice that A-Z is inside the square brackets so this will search for all uppercase letters in the alphabet.
[code class=”php”]<span id="d8a0" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph=""><strong class="je jq">[a-z] </strong>matches a string that has all lowercase letters in the entire alphabet</span><span id="7a65" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph=""><strong class="je jq">[A-Z] </strong>matches a string that has all the uppercase letters in the entire alphabet</span><span id="4a8c" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph=""><strong class="je jq">[abcd] </strong>matches a string that has a, b, c, d</span><span id="2bca" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph=""><strong class="je jq">[a-d] </strong>exactly the same as previous example so you can either specify each character or group them</span><span id="b742" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph=""><strong class="je jq">[a-gA-C0-7] </strong>matches string that has lowercase letters a-g, uppercase letters A-C, or numbers 0-7</span><span id="f424" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph=""><strong class="je jq">[^a-zA-Z] </strong>matches a string that DOES NOT have all lowercase or uppercase letters. (Inside a character set, the ^ character means all the characters that are <strong class="je jq">NOT </strong>in the a-z or A-Z.)</span>[/code]
Flags
After we end with a slash character, we can either choose one specific flag or combine them. Regex uses flags to be more specific on how to properly find and match the defined custom characters.
[code class=”php”]<span id="c295" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const sentence = ‘The Cat in the Hat is not a cat.’</span><span id="7279" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const regex = /[A-Z]/;</span><span id="85ef" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const found = sentence.match(regex);</span><span id="6343" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">console.log(found);</span><span id="ded7" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: [‘T’]</span>[/code]
Before we go into the specific flags, you should keep in mind that flags are optional. Without flags, Regex will find the first character that returns true in an array within the slashes. So in this case, our code will return [code class=”php”][‘T’] [/code]because it found the first uppercase letter in the sentence.
- The g flag stands for global which means it will return what is true within the entire regular expression. In other words, it will not only return after the first match, but ALL the occurrences that matched. Using the example above, let’s say we added the g flag at the end of our slash and set it as [code class=”php”]const regex = /[A-Z]/g[/code] Then, it will return all the characters from the regular expression that is upper case. So it will check the sentence, ‘The Cat in the Hat is not a cat.’ and return the three uppercase letters as [code class=”php”][‘T’,’C’,’H’][/code].
- The i flag stands for insensitive search which makes the entire regex expression case-insensitive. For instance, [code class=”php”]const regex = /[TheCatInTheHat]/ig[/code] where you combined both the global and case-insensitive characters, it will return each letter from the sentence into an array [code class=”php”][‘T’, ‘h’, ‘e’, ‘C’, ‘a’, ‘t’, ‘i’, ‘n’, ‘t’, ‘h’, ‘e’, ‘H’, ‘a’, ‘t’, ‘i’, ‘n’, ‘t’, ‘a’, ‘c’, ‘a’, ‘t’][/code].
- The m flag stands for multi-line which allows the character ^ and $ to match the start and end point of a line, instead of the whole string. For instance, [code class=”php”]const regex = /[A-Z]/m[/code] will return [code class=”php”][‘T’][/code]because it will find the first instance of an uppercase letter. Let’s say we changed const variable to be [code class=”php”]const regex = /[a-z]/m[/code]. The m flag will be checking to see the first instance of a lowercase letter from a-z so it will return [code class=”php”][‘h’][/code].
As an additional side note, there are three other character classes that can help when using multiple character sets to match.
[code class=”php”]<span id="a193" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const sentence = ‘There are 350 dogs and 17 cats in the house.’</span><span id="98b1" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const regex = /d/</span><span id="4278" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const found = sentence.match(regex);</span><span id="c63d" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">console.log(found);</span><span id="40e0" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: [‘3’]</span>[/code]
- The d character class returns numerical characters and will match any number (digit). d is the same as [0–9] so from the example above, it will check for the first instance of a numerical character so it will return an array of [code class=”php”][‘3’][/code].
- The w character class will match any single alphanumeric character plus underscore. w is same as [a-zA-Z0–9_] so if the example above was changed to have [code class=”php”]const regex = /w/[/code], it will return [code class=”php”][‘T’][/code] as the first instance of an alphanumeric character.
- The s character class will match a whitespace character, including line breaks and tabs. To make the s character class example make more sense — if we were to change [code class=”php”]const regex = /ws/[/code],it will return an array of [code class=”php”][‘e’][/code] — the first instance of an alphanumeric character following a white space.
The negations of d, w, and s will be D, W, and S. It will find the following:
- D matches any non digit character (same as [^0-9])
- W matches any non word character (same as [^a-zA-Z0–9_])
- S matches a non whitespace character
Quantifiers
Quantifiers are basic symbols in regular expressions that have a special meaning.
- * matches previous item zero or more times
- + matches previous item once or more times
- ? matches previous item zero or one times; makes preceding item optional
- ^ matches the beginning of the string
- $ matches the end of the string
- . matches any single character (except line breaks)
- {m, n}: min is 0 or positive integer number that indicates minimum # of matches, and max is an integer equal to or greater than min indicating the maximum number of matches
Let’s go through this example to demonstrate our understanding of Regex.
[code class=”php”]<span id="3563" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const str = ‘for__if__rof__fi’</span><span id="7569" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const regex = /[a-z]+/g;</span><span id="ad32" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const found = str.match(regex);</span><span id="b7aa" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">console.log(found);</span>[/code]
You can see the regular expression where it is checking all the lowercase letters from a-z and using the + symbol to match up all the previous items. So when you console log found, it will return [code class=”php”][‘for’, ‘if’, ‘rof’, ‘fi’][/code].
Let’s say that + symbol was not there and the Regex was only:
[code class=”php”]<span id="20c6" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const regex = /[a-z]/g;</span>[/code]
Then it will return [code class=”php”][‘f’,’o’,’r’,’i’,’f’,’r’,’o’,’f’,’f’,’i’][/code].
Putting it all together
Remember this long string of characters we saw at the beginning of this article?
This Regex seen at the beginning of the article is actually a very common use case where this is applied for email address formatting. Now that we have learned the basic methods and terminologies used in Regex, let’s break this down one step at a time.
[code class=”php”]<span id="1bc9" class="jd ib dt as je b fn jf jg r jh" data-selectable-paragraph="">const email = ‘student-id@alumni.school.edu’</span><span id="76e7" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const regex = /^w+([.-]?w)+@w+([.]?w)+(.[a-zA-Z]{2,3})+$/</span><span id="3d9c" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">const found = regex.test(email);</span><span id="da73" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">console.log(found);</span><span id="a9f7" class="jd ib dt as je b fn js jt ju jv jw jg r jh" data-selectable-paragraph="">// expected output: true</span>[/code]
- First, let’s take a look at this Regex piece by piece. So from the beginning of the string, we have [code class=”php”]^w+[/code]. We can see that ^ character is simply starting off the regular expression and then checking for an alphanumeric & underscore character using the w character class. The + quantifier is there to match up the previous items. From our example, this first piece is checking the ‘student’ characters from the email: student-id@alumni.school.edu
- Next, we got our second piece of the Regex broken up as [code class=”php”]([.-]?w)+[/code]. The opening/closing parenthesis is used as the first capturing group where inside we have a character set which will search for either a “.” character or “-” character in our email. The ? is a quantifier that matches between 0 and 1 of the preceding characters so it checks to make sure that there is only one “-” or “.” followed by the w character class. There cannot be more than one of those characters consecutively in a valid email. So this second piece represents the ‘-id’ characters from the email example. If it was ‘student — id@alumni.school.edu’ with two hyphens, this would come out to be an invalid email.
- The third piece is [code class=”php”]@w+[/code] and this will be checking for the @ character in the given email followed by the w character class to check for any alphanumeric character. This covers for the ‘@alumni’ piece of the email. The + quantifier continues to match up the previous sections of the email address.
- The following piece of [code class=”php”]([.]?w)+[/code] is the same search pattern as our second piece except it’s only checking for the “.” character and alphanumeric character, excluding our “-” symbol. This represents “.school” in the email.
- The next piece [code class=”php”](.[a-zA-Z]{2,3})+[/code] is a crucial piece in checking an email format. This section is for the top-level domain (TLD) of an email address. It’s the part of a domain that comes after the dot, for example — com, org, or net. This Regex will match a “.” character and another character set that will check for any lowercase and uppercase letters. The {2, 3} will be matching between 2 and 3 of the previous matches where 2 indicates the min number of matches and 3 stands for the max number of matches. So the letters can only be up to 2–3 characters. In this case, it is ‘.edu’.
- Finally, we have the [code class=”php”]$[/code] character to end our Regex string.
And that’s it! Now we know how to use Regex for a basic email validation. Additionally, you can implement brackets, flags, and/or quantifiers in your Regex to accommodate for other edge cases not considered in our Regex string.
Conclusion
It can be very beneficial for developers to gain knowledge in Regex. As seen above, Regex is most commonly used in situations where security validation is needed. It can also be implemented when developers need to match URLs or parse through some text and/or extract certain information such as a date format of yyyy-mm-dd. Regex is everywhere!
People can easily excuse themselves from knowing Regex because it seems difficult to understand. But it doesn’t have to be. You can see it as a gradual curve and start from the basics today.
Helpful Resources
- Regex Cheatsheet
- Regular Expression Editor
- Regex Crossword Puzzle
- MDN Documentation for Regex
- RegEx Golf
Thanks for reading and I hope you all feel more comfortable using Regex in your algorithms!