This tutorial covers various concepts of regular expression (regex) with hands-on examples. It also includes usage of regex using various tools such as R and Python.
The search pattern can be finding a single letter, a fixed string or a complex pattern which consists of numeric, punctuation and character values.
There are several use-cases of regular expression in real-world. Some of them are as follows -
Lets start with the basics
Examples
1.
Test it yourself!
2.
Test it yourself!
3.
Test it yourself!
4.
\ban\b does not match an from Elephant and animal as it only perform the whole word searching.
Test it yourself!
1.
Test it yourself!
2.
Test it yourself!
3.
Test it yourself!
1.
Test it yourself!
2.
Test it yourself!
1.
Test it yourself!
2.
Test it yourself!
3.
Test it yourself!
4.
Test it yourself!
5.
Test it yourself!
6.
Test it yourself!
7.
Test it yourself!
It also helps to extract a portion of information from strings.
Test it yourself!
Test it yourself!
Enter
Test it yourself!
In R programming language, you need to add one more backslash
Select string having first letter character followed by numeric
Python
The package re can be used for regular expressions in Python.
1. re.search(pattern, x)
Search for a particular pattern in each element of a vector x
2. re.sub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
Exercises : Regular Expression
Introduction
regex is an acronym for 'Regular Expression'. It is mainly used in extracting sub-string from string by searching a specific search pattern. The search pattern is defined by regular expression.The search pattern can be finding a single letter, a fixed string or a complex pattern which consists of numeric, punctuation and character values.
Regular expressions can be used to search and replace text.
Regex Made Easy |
Uses of Regular expression
There are several use-cases of regular expression in real-world. Some of them are as follows -
- Fetch email addresses mentioned in the long paragraph
- Validate 10-digit phone number, Social Security Number and email address
- Extract text from HTML or XML code
- Rename multiple files at a single run
- Remove punctuation specified in the text
- Web scraping : Searching specific content from all the web pages that contain a specific string
- Replace complex pattern with blank or specific character
Lets start with the basics
1. Anchor and Word Boundaries
Symbol | Description |
---|---|
^ | Beginning of line |
$ | End of line |
\b | Whole word |
Examples
1.
^abc
matches the string that begins with abc in text 'abcd'Test it yourself!
2.
^the
matches the string that starts with the in text 'the beginning'Test it yourself!
3.
done$
matches the string that ends with done in text 'I am done'Test it yourself!
4.
\ban\b
matches the whole word an in text 'Elephant an animal'\ban\b does not match an from Elephant and animal as it only perform the whole word searching.
Test it yourself!
2. OR Condition
OR condition can be defined by symbols | or [ ]. See the examples below.1.
the[m|n]
matches strings them or then in text 'them then there theme'Test it yourself!
2.
the[mn]
is equivalent to the[m|n]Test it yourself!
3.
\bthe[mn]\b
matches the complete them or then in text 'them then there theme'Test it yourself!
3. Case Insensitive
Search patterns mentioned in all of the above examples are case-sensitive. To make it case insensitive, we have to use the expression(?i)
1.
(?i)abc
matches both abc and ABC in text 'abc ABC'Test it yourself!
2.
(?i)a[bd]a
performs insensitive match 'a' followed by either b or d and then a in text 'abc ABA Ada'Test it yourself!
4. Quantifiers
It talks about quantity of element(s). In simple words, it means how often a particular regex element can occur.Expression | Description |
---|---|
* | Item occurs zero or more times |
+ | Item occurs one or more times |
? | Item occurs zero or one time |
{A} | Item occurs A number of times |
{A,B} | Item occurs between A and B times |
. | Any character |
.* | Matches zero or more of any character |
1.
def*
matches strings that contains de then followed by f zero or more times. Example - de def deff defffTest it yourself!
2.
def+
matches strings having de then followed by f at least 1 time. Example - def deff defffTest it yourself!
3.
\bdef?\b
matches strings having exact match of whole de then followed by f zero or one time. Example - de defTest it yourself!
4.
\bdef{2}\b
matches strings having exact match of de then followed by f exactly two times. Example - deffTest it yourself!
5.
\bdef{2,}\b
matches strings having exact match of de then followed by f two or more times. Example - deff defffTest it yourself!
6.
\bdef{3,4}\b
matches strings having exact match of de then followed by f either 3 or 4 times. Example - deff defffTest it yourself!
7.
a.*
matches all characters after aTest it yourself!
5. Create Grouping
By using regular expression inside ( ), you can create a group which would let you apply OR condition to portion of regex or you can put in quantifier to the entire group.It also helps to extract a portion of information from strings.
ab(cd|de)*
matches strings having ab then followed by either cd or de zero or more times.Test it yourself!
6. Back Reference
(name)\1
matches text 'name' that is matched first.Test it yourself!
Replace (Substitution) using Back-reference
(ab|cd)e(fg|hi)
matches either ab or cd then followed by e then either fg or hiEnter
\1\2
in substitution, it will return values of first and second group.Test it yourself!
7. Lazy Quantifier
Lazy (or non-greedy) quantifier matches a regex element as few times as possible. However greedy quantifier matches a regex element as many as possible.
You can covert a greedy quantifier into a lazy quantifier by simply adding a ?
<.*?>
matches strings having <character(s) >.Regex lazy quantifier |
8. How to program literal meaning of dot, asterisk
By using backslash \ you can avoid asterisk and dot. In other words, it makes regex understand the literal meaning of character.
abc\* means abc* not abccTest it yourself!
In R programming language, you need to add one more backslash
abc\\*
to make R understand the true meaning of asterik here.
9. POSIX Regular Expressions
POSIX expressions use square brackets. Like regular expressions, it matches characters, digits, punctuations and many morePOSIX | Description | ASCII |
---|---|---|
[:digit:] | Digits | [0-9] |
[:lower:] | Lowercase letters | [a-z] |
[:upper:] | Uppercase letters | [A-Z] |
[:alpha:] | Lower and uppercase letters | [a-zA-Z] |
[:alnum:] | Lower and uppercase letters and digits | [a-zA-Z0-9] |
[:blank:] | Space and tab | [ \t] |
[:space:] | All whitespace characters, including line breaks | [ \t\r\n\v\f] |
[:punct:] | Punctuations | "[!\#$%()*+,\-./:;?@\\\\]^_'{|}~]" |
Select string having first letter character followed by numeric
[[:alpha:]][[:digit:]]+
- [[:alpha:]] means any letter character
- [[:digit:]] means any digit
- + means previous one or more time
Test it yourself!
1. grep(pattern, x)
Search for a particular pattern in each element of a vector x
2. gsub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
Find first match of character
Suppose you have text x = "Hello How are you doing? are you okay? I am fine"
You need to extract till first question mark ?
.
-
returnsstr_extract(x, "Hello.*\\?")
"Hello How are you doing? are you okay?"
Here\\?
to find literal question mark (not regex question mark). -
returnsstr_extract(x, "Hello.*?\\?")
"Hello How are you doing?"
Here ? in.*?
makes it non-greedy so it stops after first question mark.
How to use regex with R and Python
R1. grep(pattern, x)
Search for a particular pattern in each element of a vector x
2. gsub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
x = "sample text B2 testing B52"
gsub('[[:alpha:]][[:digit:]]+', '',x)
Python
The package re can be used for regular expressions in Python.
1. re.search(pattern, x)
Search for a particular pattern in each element of a vector x
2. re.sub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x
import re
x = 'Welcome to Python3.6'
re.sub( '[a-zA-Z]+[0-9|.]+','', x)
Exercises : Regular Expression
1. Replace abbreviation of thousand (K) with 000?
x = "K 25K 2K"
Desired Output : K 25000 2000
Desired Output : K 25000 2000
Show Solution
gsub('([0-9])K', '\\1000',x)
Using two backward slash as a single backward slash not allowed in R
Using two backward slash as a single backward slash not allowed in R
2. Remove extra characters
x = "var1_avg_a1 var1_a_avg_7"
Desired Output :var1 var1_a
Desired Output :var1 var1_a
Show Solution
gsub('_avg_.*?[0-9]', '',x)
? making the regular expression non-greedy (lazy) quantifier
? making the regular expression non-greedy (lazy) quantifier
Interesting article.
ReplyDeleteIn addition, the caret '^' symbol comes under Negated character class. When caret is used after opening square bracket '[', it results in matching any character except the characters inside square brackets.
For Example: /^a[^b]a$/ will match any 3 characters starting and ending with 'a' except 'aba'.
Matches : aca, a a, a_a, a1a and so on..