Regex Tutorial with Examples

Deepanshu Bhalla 1 Comment , ,
This tutorial covers various concepts of regular expression (regex) with hands-on examples. It also includes usage of regex using various tools such as R and Python.

Introduction

regex is an acronym for 'Regular Expression'. It is mainly used in extracting sub-string from string by searching a specific search pattern. The search pattern is defined by regular expression.

The search pattern can be finding a single letter, a fixed string or a complex pattern which consists of numeric, punctuation and character values.
Regular expressions can be used to search and replace text.
Regex Made Easy


Uses of Regular expression

There are several use-cases of regular expression in real-world. Some of them are as follows -
  1. Fetch email addresses mentioned in the long paragraph
  2. Validate 10-digit phone number, Social Security Number and email address
  3. Extract text from HTML or XML code
  4. Rename multiple files at a single run
  5. Remove punctuation specified in the text
  6. Web scraping : Searching specific content from all the web pages that contain a specific string
  7. Replace complex pattern with blank or specific character


Lets start with the basics

1. Anchor and Word Boundaries

Symbol Description
^ Beginning of line
$ End of line
\b Whole word

Examples

1. ^abc matches the string that begins with abc in text 'abcd'
Test it yourself!

2. ^the matches the string that starts with the in text 'the beginning'
Test it yourself!

3. done$ matches the string that ends with done in text 'I am done'
Test it yourself!

4. \ban\b matches the whole word an in text 'Elephant an animal'
\ban\b does not match an from Elephant and animal as it only perform the whole word searching.
 Test it yourself!

2. OR Condition

OR condition can be defined by symbols | or [ ]. See the examples below.

1. the[m|n] matches strings them or then in text 'them then there theme'
Test it yourself!

2. the[mn] is equivalent to the[m|n]
Test it yourself!

3. \bthe[mn]\b matches the complete them or then in text 'them then there theme'
Test it yourself!

3. Case Insensitive

Search patterns mentioned in all of the above examples are case-sensitive. To make it case insensitive, we have to use the expression (?i)

1. (?i)abc matches both abc and ABC in text 'abc ABC'
Test it yourself!

2. (?i)a[bd]a performs insensitive match 'a' followed by either b or d and then a in text 'abc ABA Ada'
Test it yourself!

4. Quantifiers

It talks about quantity of element(s). In simple words, it means how often a particular regex element can occur.
Expression Description
* Item occurs zero or more times
+ Item occurs one or more times
? Item occurs zero or one time
{A} Item occurs A number of times
{A,B} Item occurs between A and B times
. Any character
.* Matches zero or more of any character

1. def* matches strings that contains de then followed by f zero or more times. Example - de def deff defff
Test it yourself!

2. def+ matches strings having de then followed by f at least 1 time. Example - def deff defff
Test it yourself!

3. \bdef?\b matches strings having exact match of whole de then followed by f zero or one time. Example - de def
Test it yourself!

4. \bdef{2}\b matches strings having exact match of de then followed by exactly two times. Example - deff
Test it yourself!

5. \bdef{2,}\b matches strings having exact match of de then followed by two or more times. Example - deff defff
Test it yourself!

6. \bdef{3,4}\b matches strings having exact match of de then followed by either 3 or 4 times. Example - deff defff
Test it yourself!

7. a.* matches all characters after a
Test it yourself!

5. Create Grouping

By using regular expression inside ( ), you can create a group which would let you apply OR condition to portion of regex or you can put in quantifier to the entire group.

It also helps to extract a portion of information from strings.

ab(cd|de)* matches strings having ab then followed by either cd or de zero or more times.
Test it yourself!

6. Back Reference

(name)\1 matches text 'name' that is matched first.
Test it yourself!

Replace (Substitution) using Back-reference

(ab|cd)e(fg|hi) matches either ab or cd then followed by e then either fg or hi
Enter \1\2 in substitution, it will return values of first and second group.
Test it yourself!

7. Lazy Quantifier

Lazy (or non-greedy) quantifier matches a regex element as few times as possible. However greedy quantifier matches a regex element as many as possible.
You can covert a greedy quantifier into a lazy quantifier by simply adding a ?

<.*?> matches strings having <character(s) >.
Regex lazy quantifier


8. How to program literal meaning of dot, asterisk

By using backslash \  you can avoid asterisk and dot. In other words, it makes regex understand the literal meaning of character.
abc\* means abc* not abcc
Test it yourself!
In R programming language, you need to add one more backslash abc\\* to make R understand the true meaning of asterik here.

9. POSIX Regular Expressions

POSIX expressions use square brackets. Like regular expressions, it matches characters, digits, punctuations and many more
POSIX Description ASCII
[:digit:] Digits [0-9]
[:lower:] Lowercase letters [a-z]
[:upper:] Uppercase letters [A-Z]
[:alpha:] Lower and uppercase letters [a-zA-Z]
[:alnum:] Lower and uppercase letters and digits [a-zA-Z0-9]
[:blank:] Space and tab [ \t]
[:space:] All whitespace characters, including line breaks [ \t\r\n\v\f]
[:punct:] Punctuations "[!\#$%()*+,\-./:;?@\\\\]^_'{|}~]"

Select string having first letter character followed by numeric
[[:alpha:]][[:digit:]]+
  1. [[:alpha:]] means any letter character
  2. [[:digit:]] means any digit
  3. +  means previous one or more time
Test it yourself!

Find first match of character
Suppose you have text x = "Hello How are you doing? are you okay? I am fine" You need to extract till first question mark ?.
  1. str_extract(x, "Hello.*\\?")
    returns "Hello How are you doing? are you okay?" Here \\? to find literal question mark (not regex question mark).
  2. str_extract(x, "Hello.*?\\?")
    returns "Hello How are you doing?" Here ? in .*? makes it non-greedy so it stops after first question mark.
How to use regex with R and Python
R

1. grep(pattern, x)
Search for a particular pattern in each element of a vector x

2. gsub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
x = "sample text B2 testing B52"
gsub('[[:alpha:]][[:digit:]]+', '',x)

Python

The package re can be used for regular expressions in Python.

1. re.search(pattern, x)
Search for a particular pattern in each element of a vector x

2. re.sub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
import re
x = 'Welcome to Python3.6'
re.sub( '[a-zA-Z]+[0-9|.]+','', x)

Exercises : Regular Expression

1. Replace abbreviation of thousand (K) with 000?

x = "K 25K 2K"
Desired Output : K 25000 2000

Show Solution
gsub('([0-9])K', '\\1000',x)

Using two backward slash as a single backward slash not allowed in R

2. Remove extra characters

x = "var1_avg_a1 var1_a_avg_7"
Desired Output :var1 var1_a

Show Solution
gsub('_avg_.*?[0-9]', '',x)

? making the regular expression non-greedy (lazy) quantifier
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 1 Response to "Regex Tutorial with Examples"
  1. Interesting article.
    In addition, the caret '^' symbol comes under Negated character class. When caret is used after opening square bracket '[', it results in matching any character except the characters inside square brackets.
    For Example: /^a[^b]a$/ will match any 3 characters starting and ending with 'a' except 'aba'.
    Matches : aca, a a, a_a, a1a and so on..

    ReplyDelete
Next → ← Prev