This article will show you how to filter data in R using dplyr, data.table packages and base R, along with examples.
Filtering data is a basic data analysis task which allows you to extract a portion of data that meets some criteria. It's similar to the "WHERE" part in SQL or the "filter" function in MS Excel to choose certain rows based on certain rules.
We will use the dataset containing details of flights departing from NYC in 2013. This dataset has 336,776 rows and 17 columns. To import dataset, we can use the read.csv( )
function.
df <- read.csv("https://raw.githubusercontent.com/JackyP/testing/master/datasets/nycflights.csv", header = TRUE)
Filter DataFrame
Select flights details for JetBlue Airways that has origin from JFK
airport
dplyr
Make sure dplyr package is already installed before submitting the following code. If it is not installed, you can install it by using the command install.packages("dplyr")
.
library(dplyr) newdf <- df %>% filter(origin == "JFK")
In dplyr, you use the %>%
pipe operator to chain operations. The filter()
function is used to filter rows based on the specified conditions.
data table
Make sure data.table package is already installed before submitting the following code. If it is not installed, you can install it by using the command install.packages("data.table")
.
library(data.table) dt <- as.data.table(df) newdf <- dt[origin == "JFK"]
In data.table, you first convert the data frame to a data.table using as.data.table()
, and then you can use data.table's syntax for subsetting and filtering.
Base R
df$origin == "JFK" checks if the "origin" column of the data frame "df" is equal to "JFK".
newdf <- df[df$origin == "JFK", ]
Result: The filtered dataset named newdf
has 111,279 rows.
Filter DataFrame Based on Multiple Conditions
Example 1: AND Condition
Select flights details for JetBlue Airways that has a carrier code ofB6
with origin fromJFK
airport.
In R, the &
symbol is used as a logical operator to perform "AND" conditions. It combines two conditions, and only the rows where both conditions are TRUE will be selected. This effectively filters the data frame "df" to only include rows where both the "origin" is "JFK" and the "carrier" is "B6".
dplyr
library(dplyr) newdf <- df %>% filter(origin == "JFK" & carrier == "B6")
In dplyr, you use the %>%
pipe operator to chain operations. The filter()
function is used to filter rows based on the specified conditions.
data table
library(data.table) dt <- as.data.table(df) newdf <- dt[origin == "JFK" & carrier == "B6"]
In data.table, you first convert the data frame to a data.table using as.data.table()
, and then you can use data.table's syntax for subsetting and filtering.
Base R
newdf <- df[df$origin == "JFK" & df$carrier == "B6", ]
Result: The filtered dataset named newdf
has 42,076 rows.
Example 2: OR Condition
Select flights details for JetBlue Airways that has a carrier code ofB6
OR origin fromJFK
airport
In R, the |
symbol is used as a logical operator to perform "OR" conditions. It is used to combine two conditions, and the resulting condition is true if at least one of the conditions is true.
dplyr
library(dplyr) newdf <- df %>% filter(origin == "JFK" | carrier == "B6")
data table
library(data.table) dt <- as.data.table(df) newdf <- dt[origin == "JFK" | carrier == "B6"]
Base R
newdf <- df[df$origin == "JFK" | df$carrier == "B6", ]
Result: The filtered dataset named newdf
has 123,838 rows.
Example 3: Multiple OR Conditions
Select flights details for JetBlue Airways with a carrier code of eitherB6
,US
orAA
.
In R, the %in%
operator is used in place of writing multiple OR conditions. It checks whether elements belong to a specific vector. In this case, we are using it to filter rows based on whether the "carrier" column contains values that are in the vector ("B6", "US", "AA").
dplyr
library(dplyr) newdf <- df %>% filter(carrier %in% c("B6", "US", "AA"))
data table
library(data.table) dt <- as.data.table(df) newdf <- dt[carrier %in% c("B6", "US", "AA")]
Base R
newdf <- df[df$carrier %in% c("B6", "US", "AA"), ]
Result: The filtered dataset named newdf
has 107,900 rows.
Example 4: NOT EQUAL TO Condition
Suppose you want to remove all the flight details where origin is from JFK.
In R, the !=
implies NOT EQUAL TO condition.
dplyr
library(dplyr) newdf <- df %>% filter(origin != "JFK")
data table
library(data.table) dt <- as.data.table(df) newdf <- dt[origin != "JFK"]
Base R
newdf <- df[df$origin != "JFK", ]
Result: The filtered dataset named newdf
has 225,497 rows.
Example 5: NOT IN Condition
Suppose you want to remove all the flight details with a carrier code of eitherB6
,US
orAA
.
In R, the !
operator is used to negate the logical condition, effectively selecting rows where the "carrier" column is not in the specified set ("B6", "US", "AA"). This achieves the "NOT IN" condition, excluding rows with the specified values.
dplyr
library(dplyr) newdf <- df %>% filter(!(carrier %in% c("B6", "US", "AA")))
data table
library(data.table) dt <- as.data.table(df) newdf <- dt[!(carrier %in% c("B6", "US", "AA"))]
Base R
newdf <- df[!(df$carrier %in% c("B6", "US", "AA")), ]
Result: The filtered dataset named newdf
has 228,876 rows.
Filter Dataframe by Row and Column Position
Suppose you want to select specific rows and columns by their position. For example, you want to select rows from the second to the fifth, including the first two columns.
dplyr
The following code uses the slice() and select() functions which are part of the dplyr package in R. The slice() function is used to extract specific rows from a dataframe. The select() function is used to extract specific columns from a dataframe.
library(dplyr) # Second to Fifth row newdf1 <- df %>% slice(2:5) # Fifth row and 1st column newdf2 <- df %>% slice(5) %>% select(1) # Third to Seventh row, 2nd and 3rd column newdf3 <- df %>% slice(3:7) %>% select(2:3)
data table
library(data.table) dt <- as.data.table(df) # Second to Fifth row newdf1 <- dt[2:5] # Fifth row and 1st column newdf2 <- dt[5, 1] # Third to Seventh row, 2nd and 3rd column newdf3 <- dt[3:7, 2:3]
Base R
newdf1 <- df[2:5,] #Second to Fifth row newdf2 <- df[5,1] #Fifth row and 1st column newdf3 <- df[3:7,2:3] #Third to Seventh row, 2nd and 3rd column
Filter Dataframe by Rows Position and Column Names
Here we are selecting rows from the third to the seventh row, including "origin" and "dest" columns.
dplyr
library(dplyr) # Third to Seventh row, including "origin" and "dest" columns newdf <- df %>% slice(3:7) %>% select(c("origin","dest"))
data table
library(data.table) dt <- as.data.table(df) # Third to Seventh row, including "origin" and "dest" columns newdf <- dt[3:7, c("origin","dest")]
Base R
# Third to Seventh row, including "origin" and "dest" columns newdf <- df[3:7, c("origin","dest")]
Select Non-Missing Data in Dataframe
In R, we can use the is.na()
function to identify missing (NA) values. In the example below, we are removing missing values from origin column. Since this dataframe does not contain any blank values, you would find the same number of rows in "newdf_non_missing" dataframe.
dplyr
library(dplyr) # Select rows where origin is not missing newdf_non_missing <- df %>% filter(!is.na(origin))
data table
library(data.table) dt <- as.data.table(df) # Select rows where origin is not missing newdf_non_missing <- dt[!is.na(origin)]
Base R
# Select rows where origin is not missing newdf_non_missing <- df[!is.na(df$origin),]
Choosing the Right R Package for Filtering
- Use dplyr: When you need a clean, intuitive syntax and a wide range of data manipulation functions. It's great for interactive analysis and code readability.
- Use data.table: When working with large datasets and aiming to optimize for speed and memory usage. It's especially useful for performance-critical tasks.
- Use Base R: When dealing with simpler filtering tasks or aiming to avoid adding external package dependencies.
Share Share Tweet