This tutorial explains various string (character) functions used in Python, with examples. To manipulate strings and character values, python has several built-in functions.
List of Frequently Used String Functions in Python
The table below shows many common string functions in Python along with their descriptions and their equivalent functions in MS Excel.
Function | Description | MS EXCEL FUNCTION |
---|---|---|
mystring[:N] | Extract N number of characters from start of string. | LEFT( ) |
mystring[-N:] | Extract N number of characters from end of string | RIGHT( ) |
mystring[X:Y] | Extract characters from middle of string, starting from X position and ends with Y | MID( ) |
str.split(sep=' ') | Split Strings | - |
str.replace(old_substring, new_substring) | Replace a part of text with different sub-string | REPLACE( ) |
str.lower() | Convert characters to lowercase | LOWER( ) |
str.upper() | Convert characters to uppercase | UPPER( ) |
str.contains('pattern', case=False) | Check if pattern matches (Pandas Function) | SQL LIKE Operator |
str.extract(regular_expression) | Return matched values (Pandas Function) | - |
str.count('sub_string') | Count occurence of pattern in string | - |
str.find( ) | Return position of sub-string or pattern | FIND( ) |
str.isalnum() | Check whether string consists of only alphanumeric characters | - |
str.islower() | Check whether characters are all lower case | - |
str.isupper() | Check whether characters are all upper case | - |
str.isnumeric() | Check whether string consists of only numeric characters | - |
str.isspace() | Check whether string consists of only whitespace characters | - |
len( ) | Calculate length of string | LEN( ) |
cat( ) | Concatenate Strings (Pandas Function) | CONCATENATE( ) |
separator.join(str) | Concatenate Strings | CONCATENATE( ) |
LEFT, RIGHT and MID Functions
If you are intermediate MS Excel users, you must have used LEFT, RIGHT and MID Functions. These functions are used to extract N number of characters or letters from string.
mystring = "Hey buddy, wassup?" mystring[:2]
'He'
string[start:stop:step]
means item start from 0 (default) through (stop-1), step by 1 (default).mystring[:2]
is equivalent tomystring[0:2]
mystring[:2]
tells Python to pull first 2 characters frommystring
string object.- Indexing starts from zero so it includes first, second element and excluding third.
mystring[-2:]
The above command returns p?
.The -2 starts the range from second last position through maximum length of string.
mystring[1:3]
'ey'
mystring[1:3]
returns second and third characters. 1 refers to second character as index begins with 0.
mystring[::-1]
'?pussaw ,yddub yeH'
-1 tells Python to start it from end and increment it by 1 from right to left.
Let's create a fake data frame for illustration. In the code below, we are creating a dataframe named df
containing only 1 variable called var1
import pandas as pd df = pd.DataFrame({"var1": ["A_2", "B_1", "C_2", "A_2"]})
var1 0 A_2 1 B_1 2 C_2 3 A_2
To deal text data in Python Pandas Dataframe, we can use str
attribute. It can be used for slicing character values.
df['var1'].str[0]
In this case, we are fetching first character from var1
variable. See the output shown below.
0 A 1 B 2 C 3 A
Extract Words from String
Suppose you need to take out word(s) instead of characters from string. Generally we consider one blank space as delimiter to find words from string.
mystring.split()[0]
'Hey'
split()
function breaks string using space as a default separatormystring.split()
returns['Hey', 'buddy,', 'wassup?']
0
returns first item or wordHey
mystring.split(',')[0]
Out[1]: 'Hey buddy'
mystring.split()[-1]
Out[1]: 'wassup?'
Let's build a dummy data frame consisting of customer names and call it variable custname
mydf = pd.DataFrame({"custname": ["Priya_Sehgal", "David_Stevart", "Kasia_Woja", "Sandy_Dave"]})
custname 0 Priya_Sehgal 1 David_Stevart 2 Kasia_Woja 3 Sandy_Dave
#First Word mydf['fname'] = mydf['custname'].str.split('_').str[0] #Last Word mydf['lname'] = mydf['custname'].str.split('_').str[1]
custname fname lname 0 Priya_Sehgal Priya Sehgal 1 David_Stevart David Stevart 2 Kasia_Woja Kasia Woja 3 Sandy_Dave Sandy Dave
str.split( )
is similar tosplit( )
. It is used to activate split function in pandas data frame in Python.- In the code above, we created two new columns named
fname
andlname
storing first and last name.
SQL LIKE Operator in Pandas DataFrame
In SQL, LIKE Statement is used to find out if a character string matches or contains a pattern. We can implement similar functionality in python using str.contains( )
function.
df2 = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "a_2"], "var2": ["X_2", "Y_1", "Z_2", "X2"]})
var1 var2 0 AA_2 X_2 1 B_1 Y_1 2 C_2 Z_2 3 a_2 X2
The following code tells Python to find rows containing either 'A' or 'B' in variable 'var1'.
df2['var1'].str.contains('A|B')
str.contains(pattern)
is used to match pattern in Pandas Dataframe.
0 True 1 True 2 False 3 False
The above command returns FALSE against fourth row as the function is case-sensitive. To ignore case-sensitivity, we can use case=False
parameter. See the working example below.
df2['var1'].str.contains('A|B', case=False)
In the following program, we are asking Python to subset data with condition - contain character values either A or B. It is equivalent to WHERE keyword in SQL.
df2[df2['var1'].str.contains('A|B', case=False)]
var1 var2 0 AA_2 X_2 1 B_1 Y_1 3 a_2 X2
Suppose you want only those values that have alphabet followed by '_'
df2[df2['var1'].str.contains('^[A-Z]_', case=False)]
^
is a token of regular expression which means begin with a particular item.
var1 var2 1 B_1 Y_1 2 C_2 Z_2 3 a_2 X2
Find position of a particular character or keyword
str.find(pattern)
is used to find position of sub-string. In this case, sub-string is '_'.
df2['var1'].str.find('_')
0 2 1 1 2 1 3 1
Replace substring
str.replace(old_text,new_text,case=False)
is used to replace a particular character(s) or pattern with some new value or pattern. In the code below, we are replacing _ with -- in variable var1.
df2['var1'].str.replace('_', '--', case=False)
0 AA--2 1 B--1 2 C--2 3 A--2
We can also complex patterns like the following program. +
means item occurs one or more times. In this case, alphabet occurring 1 or more times.
df2['var1'].str.replace('[A-Z]+_', 'X', case=False)
0 X2 1 X1 2 X2 3 X2
Find length of string
len(string)
is used to calculate length of string. In pandas data frame, you can apply str.len()
for the same.
df2['var1'].str.len()
0 4 1 3 2 3 3 3
To find count of occurrence of a particular character (let's say, how many time 'A' appears in each row), you can use str.count(pattern)
function.
df2['var1'].str.count('A')
Convert to lowercase and uppercase
str.lower()
and str.upper()
functions are used to convert string to lower and uppercase values.
#Convert to lower case mydf['custname'].str.lower() #Convert to upper case mydf['custname'].str.upper()
Remove Leading and Trailing Spaces
str.strip()
removes both leading and trailing spaces.str.lstrip()
removes leading spaces (at beginning).str.rstrip()
removes trailing spaces (at end).
df1 = pd.DataFrame({'y1': [' jack', 'jill ', ' jesse ', 'frank ']}) df1['both']=df1['y1'].str.strip() df1['left']=df1['y1'].str.lstrip() df1['right']=df1['y1'].str.rstrip()
y1 both left right 0 jack jack jack jack 1 jill jill jill jill 2 jesse jesse jesse jesse 3 frank frank frank frank
Convert Numeric to String
With the use of str( )
function, you can convert numeric value to string.
myvariable = 4 mystr = str(myvariable)
Concatenate or Join Strings
By simply using +
, you can join two string values.
x = "Deepanshu" y ="Bhalla" x+y
DeepanshuBhalla
In case you want to add a space between two strings, you can use this - x+' '+y
returns Deepanshu Bhalla
Suppose you have a list containing multiple string values and you want to combine them. You can use join( ) function.
string0 = ['Ram', 'Kumar', 'Singh'] ' '.join(string0)
'Ram Kumar Singh'
Suppose you want to combine or concatenate two columns of pandas dataframe.
mydf['fullname'] = mydf['fname'] + ' ' + mydf['lname']OR
mydf['fullname'] = mydf[['fname', 'lname']].apply(lambda x: ' '.join(x), axis=1)
custname fname lname fullname 0 Priya_Sehgal Priya Sehgal Priya Sehgal 1 David_Stevart David Stevart David Stevart 2 Kasia_Woja Kasia Woja Kasia Woja 3 Sandy_Dave Sandy Dave Sandy Dave
SQL IN Operator in Pandas
We can use isin(list)
function to include multiple values in our filtering or subsetting criteria.
mydata = pd.DataFrame({'product': ['A', 'B', 'B', 'C','C','D','A']}) mydata[mydata['product'].isin(['A', 'B'])]
product 0 A 1 B 2 B 6 A
We can use sign ~
to tell python to negate the condition.
mydata[~mydata['product'].isin(['A', 'B'])]
Extract a particular pattern from string
str.extract(r'regex-pattern')
is used for this task.
df2['var1'].str.extract(r'(^[A-Z]_)')
r'(^[A-Z]_)'
means starts with A-Z and then followed by '_'
0 NaN 1 B_ 2 C_ 3 NaN
To remove missing values, we can use dropna( )
function.
df2['var1'].str.extract(r'(^[A-Z]_)').dropna()
Thanks to Listendata for an excellent article. I love Listendata.
ReplyDeleteNice article, it was very helpful to me. Thanks!
ReplyDelete