How to Delete Columns with Missing Values in R

This tutorial explains multiple methods for deleting columns with missing values in R.

Let's create a sample dataframe.

# Create a sample dataframe
data <- data.frame(
  A = c(1, 2, NA, 4),
  B = c(NA, 2, 3, 4),
  C = c(1, NA, 3, NA),
  D = c(1, 2, 3, 4)
)

As you can see in the table below, there is only one column named "D" having no missing values at all.

	A	B	C	D
1	1	NA	1	1
2	2	2	NA	2
3	NA	3	3	3
4	4	4	NA	4

Method 1: Remove columns with missing values using base R

In this example, colSums(is.na(data)) calculates the sum of missing values for each column, and then colSums(is.na(data)) == 0 creates a logical vector indicating whether each column has no missing values. By using this logical vector to index the columns of the dataframe, you can remove the columns with missing values.

data_clean <- data[colSums(is.na(data)) == 0]
print(data_clean)

Method 2: Remove columns with missing values using dplyr package

The following code uses the select_if function from dplyr package which is used to select columns based on a condition. The ~ !any(is.na(.)) condition checks if there are no missing values in each column and selects only the columns that meet this condition.

library(dplyr)

# Remove columns with missing values using dplyr
data_clean <- data %>%
  select_if(~ !any(is.na(.)))

print(data_clean)

Remove Columns with More Than X% Missing Values in R

The following code removes columns from the "data" dataframe where more than 40% of the values are missing (NA). It returns columns A, B and D.

final <- data[colMeans(is.na(data)) < .4]
print(final)

is.na(data) returns a dataframe of the same dimensions as data, but with TRUE values where there are missing values (NA) and FALSE where values are not missing.
colMeans(is.na(data)) calculates the proportion of missing values in each column by taking the mean of the TRUE values (which represent missing values) in each column.
colMeans(is.na(data)) < .4 generates a logical vector indicating for each column whether the proportion of missing values is less than 0.4 (40%).
data[colMeans(is.na(data)) < .4] selects only those columns where the condition is TRUE, i.e., where less than 40% of the values are missing.

Remove Columns with More Than X% Missing Values in R using dplyr package

The following code uses the dplyr package to select columns from the data where the proportion of missing values (NA) is less than 0.4.

library(dplyr)
final <- data %>%
  select_if(~ mean(is.na(.)) < 0.4)
print(final)

~ mean(is.na(.)) < 0.4 is a lambda function applied to each column. It calculates the proportion of missing values in each column using mean(is.na(.)) and checks if it's less than 0.4.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn