This tutorial explains various string (character) functions used in Python, with examples. To manipulate strings and character values, python has several built-in functions.
List of Frequently Used String Functions in Python
The table below shows many common string functions in Python along with their descriptions and their equivalent functions in MS Excel.
| Function | Description | MS EXCEL FUNCTION |
|---|---|---|
| mystring[:N] | Extract N number of characters from start of string. | LEFT( ) |
| mystring[-N:] | Extract N number of characters from end of string | RIGHT( ) |
| mystring[X:Y] | Extract characters from middle of string, starting from X position and ends with Y | MID( ) |
| str.split(sep=' ') | Split Strings | - |
| str.replace(old_substring, new_substring) | Replace a part of text with different sub-string | REPLACE( ) |
| str.lower() | Convert characters to lowercase | LOWER( ) |
| str.upper() | Convert characters to uppercase | UPPER( ) |
| str.contains('pattern', case=False) | Check if pattern matches (Pandas Function) | SQL LIKE Operator |
| str.extract(regular_expression) | Return matched values (Pandas Function) | - |
| str.count('sub_string') | Count occurence of pattern in string | - |
| str.find( ) | Return position of sub-string or pattern | FIND( ) |
| str.isalnum() | Check whether string consists of only alphanumeric characters | - |
| str.islower() | Check whether characters are all lower case | - |
| str.isupper() | Check whether characters are all upper case | - |
| str.isnumeric() | Check whether string consists of only numeric characters | - |
| str.isspace() | Check whether string consists of only whitespace characters | - |
| len( ) | Calculate length of string | LEN( ) |
| cat( ) | Concatenate Strings (Pandas Function) | CONCATENATE( ) |
| separator.join(str) | Concatenate Strings | CONCATENATE( ) |
LEFT, RIGHT and MID Functions
If you are intermediate MS Excel users, you must have used LEFT, RIGHT and MID Functions. These functions are used to extract N number of characters or letters from string.
mystring = "Hey buddy, wassup?" mystring[:2]
'He'
string[start:stop:step]means item start from 0 (default) through (stop-1), step by 1 (default).mystring[:2]is equivalent tomystring[0:2]mystring[:2]tells Python to pull first 2 characters frommystringstring object.- Indexing starts from zero so it includes first, second element and excluding third.
mystring[-2:]
The above command returns p?.The -2 starts the range from second last position through maximum length of string.
mystring[1:3]
'ey'
mystring[1:3] returns second and third characters. 1 refers to second character as index begins with 0.
mystring[::-1]
'?pussaw ,yddub yeH'
-1 tells Python to start it from end and increment it by 1 from right to left.
Let's create a fake data frame for illustration. In the code below, we are creating a dataframe named df containing only 1 variable called var1
import pandas as pd
df = pd.DataFrame({"var1": ["A_2", "B_1", "C_2", "A_2"]})
var1 0 A_2 1 B_1 2 C_2 3 A_2
To deal text data in Python Pandas Dataframe, we can use str attribute. It can be used for slicing character values.
df['var1'].str[0]
In this case, we are fetching first character from var1 variable. See the output shown below.
0 A 1 B 2 C 3 A
Extract Words from String
Suppose you need to take out word(s) instead of characters from string. Generally we consider one blank space as delimiter to find words from string.
mystring.split()[0]
'Hey'
split()function breaks string using space as a default separatormystring.split()returns['Hey', 'buddy,', 'wassup?']0returns first item or wordHey
mystring.split(',')[0]
Out[1]: 'Hey buddy'
mystring.split()[-1]
Out[1]: 'wassup?'
Let's build a dummy data frame consisting of customer names and call it variable custname
mydf = pd.DataFrame({"custname": ["Priya_Sehgal", "David_Stevart", "Kasia_Woja", "Sandy_Dave"]})
custname
0 Priya_Sehgal
1 David_Stevart
2 Kasia_Woja
3 Sandy_Dave
#First Word
mydf['fname'] = mydf['custname'].str.split('_').str[0]
#Last Word
mydf['lname'] = mydf['custname'].str.split('_').str[1]
custname fname lname
0 Priya_Sehgal Priya Sehgal
1 David_Stevart David Stevart
2 Kasia_Woja Kasia Woja
3 Sandy_Dave Sandy Dave
str.split( )is similar tosplit( ). It is used to activate split function in pandas data frame in Python.- In the code above, we created two new columns named
fnameandlnamestoring first and last name.
SQL LIKE Operator in Pandas DataFrame
In SQL, LIKE Statement is used to find out if a character string matches or contains a pattern. We can implement similar functionality in python using str.contains( ) function.
df2 = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "a_2"],
"var2": ["X_2", "Y_1", "Z_2", "X2"]})
var1 var2 0 AA_2 X_2 1 B_1 Y_1 2 C_2 Z_2 3 a_2 X2
The following code tells Python to find rows containing either 'A' or 'B' in variable 'var1'.
df2['var1'].str.contains('A|B')
str.contains(pattern) is used to match pattern in Pandas Dataframe.
0 True 1 True 2 False 3 False
The above command returns FALSE against fourth row as the function is case-sensitive. To ignore case-sensitivity, we can use case=False parameter. See the working example below.
df2['var1'].str.contains('A|B', case=False)
In the following program, we are asking Python to subset data with condition - contain character values either A or B. It is equivalent to WHERE keyword in SQL.
df2[df2['var1'].str.contains('A|B', case=False)]
var1 var2 0 AA_2 X_2 1 B_1 Y_1 3 a_2 X2
Suppose you want only those values that have alphabet followed by '_'
df2[df2['var1'].str.contains('^[A-Z]_', case=False)]
^ is a token of regular expression which means begin with a particular item.
var1 var2 1 B_1 Y_1 2 C_2 Z_2 3 a_2 X2
Find position of a particular character or keyword
str.find(pattern) is used to find position of sub-string. In this case, sub-string is '_'.
df2['var1'].str.find('_')
0 2 1 1 2 1 3 1
Replace substring
str.replace(old_text,new_text,case=False) is used to replace a particular character(s) or pattern with some new value or pattern. In the code below, we are replacing _ with -- in variable var1.
df2['var1'].str.replace('_', '--', case=False)
0 AA--2 1 B--1 2 C--2 3 A--2
We can also complex patterns like the following program. + means item occurs one or more times. In this case, alphabet occurring 1 or more times.
df2['var1'].str.replace('[A-Z]+_', 'X', case=False)
0 X2 1 X1 2 X2 3 X2
Find length of string
len(string) is used to calculate length of string. In pandas data frame, you can apply str.len() for the same.
df2['var1'].str.len()
0 4 1 3 2 3 3 3
To find count of occurrence of a particular character (let's say, how many time 'A' appears in each row), you can use str.count(pattern) function.
df2['var1'].str.count('A')
Convert to lowercase and uppercase
str.lower() and str.upper() functions are used to convert string to lower and uppercase values.
#Convert to lower case mydf['custname'].str.lower() #Convert to upper case mydf['custname'].str.upper()
Remove Leading and Trailing Spaces
str.strip()removes both leading and trailing spaces.str.lstrip()removes leading spaces (at beginning).str.rstrip()removes trailing spaces (at end).
df1 = pd.DataFrame({'y1': [' jack', 'jill ', ' jesse ', 'frank ']})
df1['both']=df1['y1'].str.strip()
df1['left']=df1['y1'].str.lstrip()
df1['right']=df1['y1'].str.rstrip()
y1 both left right
0 jack jack jack jack
1 jill jill jill jill
2 jesse jesse jesse jesse
3 frank frank frank frank
Convert Numeric to String
With the use of str( ) function, you can convert numeric value to string.
myvariable = 4 mystr = str(myvariable)
Concatenate or Join Strings
By simply using +, you can join two string values.
x = "Deepanshu" y ="Bhalla" x+y
DeepanshuBhalla
In case you want to add a space between two strings, you can use this - x+' '+y returns Deepanshu Bhalla
Suppose you have a list containing multiple string values and you want to combine them. You can use join( ) function.
string0 = ['Ram', 'Kumar', 'Singh'] ' '.join(string0)
'Ram Kumar Singh'
Suppose you want to combine or concatenate two columns of pandas dataframe.
mydf['fullname'] = mydf['fname'] + ' ' + mydf['lname']OR
mydf['fullname'] = mydf[['fname', 'lname']].apply(lambda x: ' '.join(x), axis=1)
custname fname lname fullname
0 Priya_Sehgal Priya Sehgal Priya Sehgal
1 David_Stevart David Stevart David Stevart
2 Kasia_Woja Kasia Woja Kasia Woja
3 Sandy_Dave Sandy Dave Sandy Dave
SQL IN Operator in Pandas
We can use isin(list) function to include multiple values in our filtering or subsetting criteria.
mydata = pd.DataFrame({'product': ['A', 'B', 'B', 'C','C','D','A']})
mydata[mydata['product'].isin(['A', 'B'])]
product 0 A 1 B 2 B 6 A
We can use sign ~ to tell python to negate the condition.
mydata[~mydata['product'].isin(['A', 'B'])]
Extract a particular pattern from string
str.extract(r'regex-pattern') is used for this task.
df2['var1'].str.extract(r'(^[A-Z]_)')
r'(^[A-Z]_)' means starts with A-Z and then followed by '_'
0 NaN 1 B_ 2 C_ 3 NaN
To remove missing values, we can use dropna( ) function.
df2['var1'].str.extract(r'(^[A-Z]_)').dropna()

Thanks to Listendata for an excellent article. I love Listendata.
ReplyDeleteNice article, it was very helpful to me. Thanks!
ReplyDelete