Python : 10 Ways to Filter Pandas DataFrame

In this article, we will cover various methods to filter pandas dataframe in Python.

Data filtering is a common way to select specific rows from a dataset based on some conditions. It is similar to the WHERE clause in SQL or the filter feature in Excel. Python has an efficient way to perform filtering using the pandas library which is built on top of the NumPy library.

Syntax : Filtering

import pandas as pd
filtered_df = dataframe.query('your_query')

dataframe: Replace this with the name of the pandas DataFrame you want to filter.

Examples

Basic Example

filtered_df = df.query('age > 25')

Multiple Conditions

filtered_df = df.query('age > 25 and gender == "Male"')

Using Variables

age_threshold = 25
gender_value = 'Male'
filtered_df = df.query('age > @age_threshold and gender == @gender_value')

Import Data

Make sure pandas package is already installed before submitting the following code. You can check it by running !pip show pandas statement in Ipython console. If it is not installed, you can install it by using the command !pip install pandas.

We are going to use dataset containing details of flights departing from NYC in 2013. This dataset has 336776 rows and 16 columns. See column names below. To import dataset, we are using read_csv( ) function from pandas package.

['year', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time',
       'arr_delay', 'carrier', 'tailnum', 'flight', 'origin', 'dest',
       'air_time', 'distance', 'hour', 'minute']

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/JackyP/testing/master/datasets/nycflights.csv", usecols=range(1,17))

Filter pandas dataframe by column value

Condition : Select flights details of JetBlue Airways that has 2 letters carrier code B6 with origin from JFK airport

Method 1 : DataFrame Way

newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]

newdf.head()

Output

Out[23]: 
    year  month  day  dep_time  ...  air_time  distance  hour minute
3   2013      1    1     544.0  ...     183.0      1576   5.0   44.0
8   2013      1    1     557.0  ...     140.0       944   5.0   57.0
10  2013      1    1     558.0  ...     149.0      1028   5.0   58.0
11  2013      1    1     558.0  ...     158.0      1005   5.0   58.0
15  2013      1    1     559.0  ...      44.0       187   5.0   59.0

[5 rows x 16 columns]

Filtered data (after subsetting) is stored on new dataframe called newdf.
Symbol & refers to AND condition which means meeting both the criteria.
This part of code (df.origin == "JFK") & (df.carrier == "B6") returns True / False. True where condition matches and False where the condition does not hold. Later it is passed within df and returns all the rows corresponding to True. It returns 4166 rows.

Method 2 : Query Function

In pandas package, there are multiple ways to perform filtering. The above code can also be written like the code shown below. This method is elegant and more readable and you don't need to mention dataframe name everytime when you specify columns (variables).

newdf = df.query('origin == "JFK" & carrier == "B6"')

How to pass variables in query function

Method 3 : loc function

loc is an abbreviation of location term. All these 3 methods return same output. It's just a different ways of doing filtering rows.

newdf = df.loc[(df.origin == "JFK") & (df.carrier == "B6")]

Filter Pandas Dataframe by Row and Column Position

Suppose you want to select specific rows by their position (let's say from second through fifth row). We can use df.iloc[ ] function for the same.

Indexing in python starts from zero. df.iloc[0:5,] refers to first to fifth row (excluding end point 6th row here). df.iloc[0:5,] is equivalent to df.iloc[:5,]

df.iloc[:5,] #First 5 rows
df.iloc[1:5,] #Second to Fifth row
df.iloc[5,0] #Sixth row and 1st column
df.iloc[1:5,0] #Second to Fifth row, first column
df.iloc[1:5,:5] #Second to Fifth row, first 5 columns
df.iloc[2:7,1:3] #Third to Seventh row, 2nd and 3rd column

Difference between loc and iloc function

loc considers rows based on index labels. Whereas iloc considers rows based on position in the index so it only takes integers.

Let's create a sample data for illustration

import numpy as np
x = pd.DataFrame({"col1" : np.arange(1,20,2)}, index=[9,8,7,6,0, 1, 2, 3, 4, 5])

Output

iloc - Index Position

x.iloc[0:5]

Output

Selecting rows based on index or row position

loc - Index Label

x.loc[0:5]

Output

Selecting rows based on labels of index

How x.loc[0:5] returns 6 rows (inclusive of 5 which is 6th element)?

It is because loc does not produce output based on index position. It considers labels of index only which can be alphabet as well and includes both starting and end point. Refer the example below.

x = pd.DataFrame({"col1" : range(1,5)}, index=['a','b','c','d'])
x.loc['a':'c'] # equivalent to x.iloc[0:3]

Output

Filter pandas dataframe by rows position and column names

Here we are selecting first five rows of two columns named origin and dest.

df.loc[df.index[0:5],["origin","dest"]]

df.index returns index labels. df.index[0:5] is required instead of 0:5 (without df.index) because index labels do not always in sequence and start from 0. It can start from any number or even can have alphabet letters. Refer the example where we showed comparison of iloc and loc.

Selecting multiple values of a column

Suppose you want to include all the flight details where origin is either JFK or LGA.

# Long Way
newdf = df.loc[(df.origin == "JFK") | (df.origin == "LGA")]

# Smart Way
newdf = df[df.origin.isin(["JFK", "LGA"])]

| implies OR condition which means any of the conditions holds True. isin( ) is similar to IN operator in SAS and R which can take many values and apply OR condition. Make sure you specify values in list [ ].

Select rows whose column value does not equal a specific value

In this example, we are deleting all the flight details where origin is from JFK. != implies NOT EQUAL TO.

newdf = df.loc[(df.origin != "JFK") & (df.carrier == "B6")]

Let's check whether the above line of code works fine or not by looking at unique values of column origin in newdf.

pd.unique(newdf.origin)

# ['LGA', 'EWR']

How to negate the whole condition

Tilde ~ is used to negate the condition. It is equivalent to NOT operator in SAS and R.

newdf = df[~((df.origin == "JFK") & (df.carrier == "B6"))]

Select Non-Missing Data in Pandas Dataframe

With the use of notnull() function, you can exclude or remove NA and NAN values. In the example below, we are removing missing values from origin column. Since this dataframe does not contain any blank values, you would find same number of rows in newdf.

newdf = df[df.origin.notnull()]

Filtering String in Pandas Dataframe

It is generally considered tricky to handle text data. But python makes it easier when it comes to dealing character or string columns. Let's prepare a fake data for example.

import pandas as pd 
df = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "A_2"]})

Output

   var1
0  AA_2
1   B_1
2   C_2
3   A_2

Select rows having values starting from letter 'A'

By using .str, you can enable string functions and can apply on pandas dataframe. str[0] means first letter.

df[df['var1'].str[0] == 'A']

Filter rows having string length greater than 3

len( ) function calculates length of iterable.

df[df['var1'].str.len()>3]

Select string containing letters A or B

contains( ) function is similar to LIKE statement in SQL and SAS. You can subset data by mentioning pattern in contains( ) function.

df[df['var1'].str.contains('A|B')]

Output

   var1
0  AA_2
1   B_1
3   A_2

Handle space in column name while filtering

Let's rename a column var1 with a space in between var 1 We can rename it by using rename function.

df.rename(columns={'var1':'var 1'}, inplace = True)

By using backticks ` ` we can include the column having space. See the example code below.

 newdf = df.query("`var 1` == 'AA_2'")

Backticks are supported from version 0.25 of pandas package. Run this command in console to check pandas version !pip show pandas If you have version prior to the version 0.25 you can upgrade it by using this command !pip install --upgrade pandas --user

How to filter data without using pandas package

You can perform filtering using pure python methods without dependency on pandas package.

Warning : Methods shown below for filtering are not efficient ones. The main objective of showing the following methods is to show how to do subsetting without using pandas package. In your live project, you should use pandas' builtin functions (query( ), loc[ ], iloc[ ]) which are explained above.

We don't need to create a dataframe to store data. We can stock it in list data structure. lst_df contains flights data which were imported from CSV file.

import csv
import requests

response = requests.get('https://dyurovsky.github.io/psyc201/data/lab2/nycflights.csv').text
lines = response.splitlines()
d = csv.DictReader(lines)
lst_df = list(d)

Lambda Method for Filtering

Lambda is an alternative way of defining user defined function. With the use of lambda, you can define function in a single line of code. You can check out this link to learn more about it.

l1 = list(filter(lambda x: x["origin"] == 'JFK' and x["carrier"] == 'B6', lst_df))

If you are wondering how to use this lambda function on a dataframe, you can submit the code below.

newdf = df[df.apply(lambda x: x["origin"] == 'JFK' and x["carrier"] == 'B6', axis=1)]

List Comprehension Method for Filtering

List comprehension is an alternative to lambda function and makes code more readable. Detailed Tutorial : List Comprehension

l2 = list(x for x in lst_df if x["origin"] == 'JFK' and x["carrier"] == 'B6')

You can use list comprehension on dataframe like the way shown below.

newdf = df.iloc[[index for index,row in df.iterrows() if row['origin'] == 'JFK' and row['carrier'] == 'B6']]

Create Class for Filtering

Python is an object-oriented programming language in which code is implemented using class.

class filter:
  def __init__(self, l, query):
    self.output = []
    for data in l:
      if eval(query):
        self.output.append(data)
 
l3 = filter(lst_df, 'data["origin"] == "JFK" and data["carrier"] == "B6"').output

About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 24 Responses to "Python : 10 Ways to Filter Pandas DataFrame"

Sauna JoyJuly 13, 2019 at 7:07 AM
This is actually pretty good. All types sumed up in one place. Kudos!
ReplyDelete
Replies
RDSJuly 13, 2019 at 9:58 AM
It's very gud.They have given a clean and clear cut clartiy on all the ways of filtering the dataframe with example.
ReplyDelete
Replies
MeenaJagannathJuly 13, 2019 at 8:33 PM
fantastic coverage.......keep it up!
ReplyDelete
Replies
PlabanJuly 14, 2019 at 3:02 AM
Very good .. Covers most of areas 👍
ReplyDelete
Replies
Aneesha RaoJuly 14, 2019 at 8:34 AM
Something to note how x.loc[0:5] is inclusive of 5 i.e. the sixth element.
Very well articulated. I loved reading this article.
ReplyDelete
Replies
Imagine WorldAugust 21, 2019 at 10:10 AM
Nice One
ReplyDelete
Replies
Shukla JiAugust 23, 2019 at 6:34 AM
In not operator case, you meant to say that deleting rows where origin is JFK, right?
ReplyDelete
Replies
UnknownNovember 11, 2019 at 10:22 PM
Thank you Deepanshu. your explanation is easy to follow. Kudos 1000x
ReplyDelete
Replies
ArthurJanuary 18, 2020 at 3:25 PM
Thank you. Very well explained iloc and loc difference.
ReplyDelete
Replies
AnonymousJune 9, 2020 at 4:36 AM
Thankyou soo much this helped me a lot.
ReplyDelete
Replies
AnonymousJune 17, 2020 at 3:48 AM
Very helpful. Thank u!
ReplyDelete
Replies
AkalyaSuryakumarAugust 17, 2020 at 9:30 AM
very helpful
ReplyDelete
Replies
UnknownNovember 1, 2020 at 7:07 AM
what about cases where you need to filter rows by two or more columns that exist in another df?
you can't use lists... you need that the pairs or triplets will match.
easy to do in a for loop but is there a way to implement in vectorization way not with join/merge?
ReplyDelete
Replies
UnknownDecember 15, 2020 at 11:38 AM
Thanks, i was struggling to add variables in the query. This can be done with @variable . Maybe you can add this info also.
ReplyDelete
Replies
UnknownJanuary 29, 2021 at 2:49 PM
This is just great! thank you for sharing
ReplyDelete
Replies
jsMarch 11, 2021 at 8:12 PM
Any way to make Method 1 print all properties instead of ... for the midrange properties?
ReplyDelete
Replies
BEMMarch 24, 2021 at 7:02 PM
So, one doesn't filter using DataFrame.filter() ?
ReplyDelete
Replies
MontyJanuary 30, 2023 at 2:42 PM
Thanks Deepanshu ... nice blog. You have put lot of related solutions in a single place. Great work.
ReplyDelete
Replies

Add comment