10 Ways to Filter DataFrame in R

This article will show you how to filter data in R using dplyr, data.table packages, and Base R. It includes examples to make it easy to understand. Filtering data is a basic data analysis task, allowing you to extract specific subsets of data that meet certain criteria. It's kind of like when you use the "WHERE" part in SQL or the "filter" function in MS Excel to choose certain rows based on certain rules.

Filter Data in R
Examples of Data Filtering

Here are a few real-world examples of data filtering.

  • Select customers who are currently active and opened their accounts after January 1st, 2023.
  • Get details of customers who made more than 6 transactions in the past six months.
  • Fetch information about employees who have spent over 3 years in the organization and received the highest rating in the last 2 years.
  • Analyze complaint data to identify customers who submitted more than 5 complaints in the last year.
  • Extract information about cities where the per capita income exceeds $50,000.
Import Data

We will use the dataset containing details of flights departing from NYC in 2013. This dataset has 336,776 rows and 17 columns. To import dataset, we are using the read.csv( ) function.

df <- read.csv("https://raw.githubusercontent.com/JackyP/testing/master/datasets/nycflights.csv", header = TRUE)
Table of Contents

Filter DataFrame

Select flights details for JetBlue Airways that has origin from JFK airport

dplyr

Make sure dplyr package is already installed before submitting the following code. If it is not installed, you can install it by using the command install.packages("dplyr").

library(dplyr)
newdf <- df %>%
  filter(origin == "JFK")

In dplyr, you use the %>% pipe operator to chain operations. The filter() function is used to filter rows based on the specified conditions.

data table

Make sure data.table package is already installed before submitting the following code. If it is not installed, you can install it by using the command install.packages("data.table").

library(data.table)
dt <- as.data.table(df)
newdf <- dt[origin == "JFK"]

In data.table, you first convert the data frame to a data.table using as.data.table(), and then you can use data.table's syntax for subsetting and filtering.

Base R

df$origin == "JFK" checks if the "origin" column of the data frame "df" is equal to "JFK".

newdf <- df[df$origin == "JFK", ]

Result: The filtered dataset named newdf has 111,279 rows.

Filter DataFrame Based on Multiple Conditions

Example 1: AND Condition

Select flights details for JetBlue Airways that has a carrier code of B6 with origin from JFK airport.

In R, the & symbol is used as a logical operator to perform "AND" conditions. It combines two conditions, and only the rows where both conditions are TRUE will be selected. This effectively filters the data frame "df" to only include rows where both the "origin" is "JFK" and the "carrier" is "B6".

dplyr

library(dplyr)
newdf <- df %>%
  filter(origin == "JFK" & carrier == "B6")

In dplyr, you use the %>% pipe operator to chain operations. The filter() function is used to filter rows based on the specified conditions.

data table

library(data.table)
dt <- as.data.table(df)
newdf <- dt[origin == "JFK" & carrier == "B6"]

In data.table, you first convert the data frame to a data.table using as.data.table(), and then you can use data.table's syntax for subsetting and filtering.

Base R

newdf <- df[df$origin == "JFK" & df$carrier == "B6", ]

Result: The filtered dataset named newdf has 42,076 rows.

Example 2: OR Condition

Select flights details for JetBlue Airways that has a carrier code of B6 OR origin from JFK airport

In R, the | symbol is used as a logical operator to perform "OR" conditions. It is used to combine two conditions, and the resulting condition is true if at least one of the conditions is true.

dplyr

library(dplyr)
newdf <- df %>%
  filter(origin == "JFK" | carrier == "B6")

data table

library(data.table)
dt <- as.data.table(df)
newdf <- dt[origin == "JFK" | carrier == "B6"]

Base R

newdf <- df[df$origin == "JFK" | df$carrier == "B6", ]

Result: The filtered dataset named newdf has 123,838 rows.

Example 3: Multiple OR Conditions

Select flights details for JetBlue Airways with a carrier code of either B6, US or AA.

In R, the %in% operator is used in place of writing multiple OR conditions. It checks whether elements belong to a specific vector. In this case, we are using it to filter rows based on whether the "carrier" column contains values that are in the vector ("B6", "US", "AA").

dplyr

library(dplyr)
newdf <- df %>%
  filter(carrier %in% c("B6", "US", "AA"))

data table

library(data.table)
dt <- as.data.table(df)
newdf <- dt[carrier %in% c("B6", "US", "AA")]

Base R

newdf <- df[df$carrier %in% c("B6", "US", "AA"), ]

Result: The filtered dataset named newdf has 107,900 rows.

Example 4: NOT EQUAL TO Condition

Suppose you want to remove all the flight details where origin is from JFK.

In R, the != implies NOT EQUAL TO condition.

dplyr

library(dplyr)
newdf <- df %>%
  filter(origin != "JFK")

data table

library(data.table)
dt <- as.data.table(df)
newdf <- dt[origin != "JFK"]

Base R

newdf <- df[df$origin != "JFK", ]

Result: The filtered dataset named newdf has 225,497 rows.

Example 5: NOT IN Condition

Suppose you want to remove all the flight details with a carrier code of either B6, US or AA.

In R, the ! operator is used to negate the logical condition, effectively selecting rows where the "carrier" column is not in the specified set ("B6", "US", "AA"). This achieves the "NOT IN" condition, excluding rows with the specified values.

dplyr

library(dplyr)
newdf <- df %>%
  filter(!(carrier %in% c("B6", "US", "AA")))

data table

library(data.table)
dt <- as.data.table(df)
newdf <- dt[!(carrier %in% c("B6", "US", "AA"))]

Base R

newdf <- df[!(df$carrier %in% c("B6", "US", "AA")), ]

Result: The filtered dataset named newdf has 228,876 rows.

Filter Dataframe by Row and Column Position

Suppose you want to select specific rows and columns by their position. For example, you want to select rows from the second to the fifth, including the first two columns.

dplyr

The following code uses the slice() and select() functions which are part of the dplyr package in R. The slice() function is used to extract specific rows from a dataframe. The select() function is used to extract specific columns from a dataframe.

library(dplyr)

# Second to Fifth row
newdf1 <- df %>% slice(2:5)

# Fifth row and 1st column
newdf2 <- df %>% slice(5) %>% select(1)

# Third to Seventh row, 2nd and 3rd column
newdf3 <- df %>% slice(3:7) %>% select(2:3)

data table

library(data.table)
dt <- as.data.table(df)

# Second to Fifth row
newdf1 <- dt[2:5]

# Fifth row and 1st column
newdf2 <- dt[5, 1]

# Third to Seventh row, 2nd and 3rd column
newdf3 <- dt[3:7, 2:3]

Base R

newdf1 <- df[2:5,] #Second to Fifth row
newdf2 <- df[5,1]  #Fifth row and 1st column
newdf3 <- df[3:7,2:3] #Third to Seventh row, 2nd and 3rd column

Filter Dataframe by Rows Position and Column Names

Here we are selecting rows from the third to the seventh row, including "origin" and "dest" columns.

dplyr

library(dplyr)

# Third to Seventh row, including "origin" and "dest" columns
newdf <- df %>% slice(3:7) %>% select(c("origin","dest"))

data table

library(data.table)
dt <- as.data.table(df)

# Third to Seventh row, including "origin" and "dest" columns
newdf <- dt[3:7, c("origin","dest")]

Base R

# Third to Seventh row, including "origin" and "dest" columns
newdf <- df[3:7, c("origin","dest")]

Select Non-Missing Data in Dataframe

In R, we can use the is.na() function to identify missing (NA) values. In the example below, we are removing missing values from origin column. Since this dataframe does not contain any blank values, you would find the same number of rows in "newdf_non_missing" dataframe.

dplyr

library(dplyr)

# Select rows where origin is not missing
newdf_non_missing <- df %>%
  filter(!is.na(origin))

data table

library(data.table)
dt <- as.data.table(df)

# Select rows where origin is not missing
newdf_non_missing <- dt[!is.na(origin)]

Base R

# Select rows where origin is not missing
newdf_non_missing <- df[!is.na(df$origin),]

Choosing the Right Approach for Filtering

  • Use dplyr: When you need a clean, intuitive syntax and a wide range of data manipulation functions. It's great for interactive analysis and code readability.
  • Use data.table: When working with large datasets and aiming to optimize for speed and memory usage. It's especially useful for performance-critical tasks.
  • Use Base R: When dealing with simpler filtering tasks or aiming to avoid adding external package dependencies.
Conclusion:

Filtering dataframes is a crucial step in data analysis, and R offers multiple methods to achieve this. Your choice between dplyr, data.table, or Base R depends on your specific needs, data complexity, and performance requirements. By mastering these filtering techniques, you'll be well-equipped to efficiently extract meaningful insights from your data.

Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

0 Response to "10 Ways to Filter DataFrame in R"

Post a Comment

Next → ← Prev