Regex Tutorial with Examples

This tutorial covers various concepts of regular expression (regex) with hands-on examples. It also includes usage of regex using various tools such as R and Python.

Introduction

regex is an acronym for 'Regular Expression'. It is mainly used in extracting sub-string from string by searching a specific search pattern. The search pattern is defined by regular expression.

The search pattern can be finding a single letter, a fixed string or a complex pattern which consists of numeric, punctuation and character values.
Regular expressions can be used to search and replace text.
Regex Made Easy


Uses of Regular expression

There are several use-cases of regular expression in real-world. Some of them are as follows -
  1. Fetch email addresses mentioned in the long paragraph
  2. Validate 10-digit phone number, Social Security Number and email address
  3. Extract text from HTML or XML code
  4. Rename multiple files at a single run
  5. Remove punctuation specified in the text
  6. Web scraping : Searching specific content from all the web pages that contain a specific string
  7. Replace complex pattern with blank or specific character


Lets start with the basics

1. Anchor and Word Boundaries

Symbol Description
^ Beginning of line
$ End of line
\b Whole word

Examples

1. ^abc matches the string that begins with abc in text 'abcd'
Test it yourself!

2. ^the matches the string that starts with the in text 'the beginning'
Test it yourself!

3. done$ matches the string that ends with done in text 'I am done'
Test it yourself!

4. \ban\b matches the whole word an in text 'Elephant an animal'
\ban\b does not match an from Elephant and animal as it only perform the whole word searching.
 Test it yourself!

2. OR Condition

OR condition can be defined by symbols | or [ ]. See the examples below.

1. the[m|n] matches strings them or then in text 'them then there theme'
Test it yourself!

2. the[mn] is equivalent to the[m|n]
Test it yourself!

3. \bthe[mn]\b matches the complete them or then in text 'them then there theme'
Test it yourself!

3. Case Insensitive

Search patterns mentioned in all of the above examples are case-sensitive. To make it case insensitive, we have to use the expression (?i)

1. (?i)abc matches both abc and ABC in text 'abc ABC'
Test it yourself!

2. (?i)a[bd]a performs insensitive match 'a' followed by either b or d and then a in text 'abc ABA Ada'
Test it yourself!

4. Quantifiers

It talks about quantity of element(s). In simple words, it means how often a particular regex element can occur.
Expression Description
* Item occurs zero or more times
+ Item occurs one or more times
? Item occurs zero or one time
{A} Item occurs A number of times
{A,B} Item occurs between A and B times
. Any character
.* Matches zero or more of any character

1. def* matches strings that contains de then followed by f zero or more times. Example - de def deff defff
Test it yourself!

2. def+ matches strings having de then followed by f at least 1 time. Example - def deff defff
Test it yourself!

3. \bdef?\b matches strings having exact match of whole de then followed by f zero or one time. Example - de def
Test it yourself!

4. \bdef{2}\b matches strings having exact match of de then followed by exactly two times. Example - deff
Test it yourself!

5. \bdef{2,}\b matches strings having exact match of de then followed by two or more times. Example - deff defff
Test it yourself!

6. \bdef{3,4}\b matches strings having exact match of de then followed by either 3 or 4 times. Example - deff defff
Test it yourself!

7. a.* matches all characters after a
Test it yourself!

5. Create Grouping

By using regular expression inside ( ), you can create a group which would let you apply OR condition to portion of regex or you can put in quantifier to the entire group.

It also helps to extract a portion of information from strings.

ab(cd|de)* matches strings having ab then followed by either cd or de zero or more times.
Test it yourself!

6. Back Reference

(name)\1 matches text 'name' that is matched first.
Test it yourself!

Replace (Substitution) using Back-reference

(ab|cd)e(fg|hi) matches either ab or cd then followed by e then either fg or hi
Enter \1\2 in substitution, it will return values of first and second group.
Test it yourself!

7. Lazy Quantifier

Lazy (or non-greedy) quantifier matches a regex element as few times as possible. However greedy quantifier matches a regex element as many as possible.
You can covert a greedy quantifier into a lazy quantifier by simply adding a ?

<.*?> matches strings having <character(s) >.
Regex lazy quantifier


8. How to program literal meaning of dot, asterisk

By using backslash \  you can avoid asterisk and dot. In other words, it makes regex understand the literal meaning of character.
abc\* means abc* not abcc

9. POSIX Regular Expressions

POSIX expressions use square brackets. Like regular expressions, it matches characters, digits, punctuations and many more
POSIX Description ASCII
[:digit:] Digits [0-9]
[:lower:] Lowercase letters [a-z]
[:upper:] Uppercase letters [A-Z]
[:alpha:] Lower and uppercase letters [a-zA-Z]
[:alnum:] Lower and uppercase letters and digits [a-zA-Z0-9]
[:blank:] Space and tab [ \t]
[:space:] All whitespace characters, including line breaks [ \t\r\n\v\f]
[:punct:] Punctuations "[!\#$%()*+,\-./:;?@\\\\]^_'{|}~]"

Select string having first letter character followed by numeric
[[:alpha:]][[:digit:]]+
  1. [[:alpha:]] means any letter character
  2. [[:digit:]] means any digit
  3. +  means previous one or more time
Test it yourself!

How to use regex with R and Python

R

1. grep(pattern, x)
Search for a particular pattern in each element of a vector x

2. gsub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
x = "sample text B2 testing B52"
gsub('[[:alpha:]][[:digit:]]+', '',x)

Python

The package re can be used for regular expressions in Python.

1. re.search(pattern, x)
Search for a particular pattern in each element of a vector x

2. re.sub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
import re
x = 'Welcome to Python3.6'
re.sub( '[a-zA-Z]+[0-9|.]+','', x)

Exercises : Regular Expression

1. Replace abbreviation of thousand (K) with 000?

x = "K 25K 2K"
Desired Output : K 25000 2000

Show Solution
gsub('([0-9])K', '\\1000',x)

Using two backward slash as a single backward slash not allowed in R

2. Remove extra characters

x = "var1_avg_a1 var1_a_avg_7"
Desired Output :var1 var1_a

Show Solution
gsub('_avg_.*?[0-9]', '',x)

? making the regular expression non-greedy (lazy) quantifier
About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like banking, Telecom, HR and Health Insurance.

While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*
Related Posts:
1 Response to "Regex Tutorial with Examples"
  1. Interesting article.
    In addition, the caret '^' symbol comes under Negated character class. When caret is used after opening square bracket '[', it results in matching any character except the characters inside square brackets.
    For Example: /^a[^b]a$/ will match any 3 characters starting and ending with 'a' except 'aba'.
    Matches : aca, a a, a_a, a1a and so on..

    ReplyDelete

Next → ← Prev