How to Extract Factor Variables from DataFrame in R

In R, you can extract factor variables from a dataframe using various methods. Here are a few common ways to achieve this:

Let's create a sample data frame called mydata having 4 variables (ID, Gender, Region, Grade).

# Create a sample data frame
mydata <- data.frame(
  ID = 1:5,
  Gender = c("Male", "Female", "Male", "Male", "Female"),
  Region = c("North", "South", "East", "West", "North"),
  Grade = factor(c("A", "B", "A", "C", "B"))
)

You can use the str() function to see the structure or data type of a data frame. As shown in the result below, the variable "Grade" is a factor variable.

str(mydata)
'data.frame':	5 obs. of  4 variables:
 $ ID    : int  1 2 3 4 5
 $ Gender: chr  "Male" "Female" "Male" "Male" ...
 $ Region: chr  "North" "South" "East" "West" ...
 $ Grade : Factor w/ 3 levels "A","B","C": 1 2 1 3 2

How to Extract all Factor Variables in R

In the dataframe named "mydata", we know we have a factor variable "Grade". When we have multiple variables in a dataframe, we don't know the name of the factor variables in advance.

In base R, you can extract multiple factor columns (variables) using sapply function. The sapply function is a part of apply family of functions. They perform multiple iterations (loops) in R.

In dplyr package, the select_if function is used to select columns based on a condition. In this case, is.factor selects only the factor columns.

Base R

factor_columns <- mydata[sapply(mydata, is.factor)]
print(factor_columns)

dplyr

library(dplyr)

# Select factor columns using select_if()
factor_columns <- mydata %>% select_if(is.factor)
print(factor_columns)
Extract Factor Variables from DataFrame in R

Extract Factor Variables with more than 2 Unique Levels in R

Let's modify the "mydata" dataframe by adding one more factor variable for demonstration purpose.

mydata <- data.frame(
  ID = 1:5,
  Gender = c("Male", "Female", "Male", "Male", "Female"),
  Region = factor(c("North", "South", "East", "West", "North")),
  Grade = factor(c("A", "B", "A", "B", "B"))
)

Base R

In this code, we're using the sapply function to iterate through each column of the "mydata" data frame. For each column, we check if it's of factor data type (is.factor(col)) and if it has more than 2 unique levels (nlevels(col) > 2).

# Extract factor columns with more than 2 unique categories
factor_cols0 <- sapply(mydata, function(col) is.factor(col) && nlevels(col) > 2)

# Select columns based on the extracted factor column indicators
factor_cols <- mydata[factor_cols0]
print(factor_cols)

dplyr

In this code, we're using the dplyr package to work with data frames. The select_if function is used to select columns based on a condition. In this case, we're selecting columns that are of factor data type (is.factor(col)) and have more than 2 unique categories (nlevels(col) > 2).

library(dplyr)
factor_cols <- mydata %>%
  select_if(function(col) is.factor(col) && nlevels(col) > 2)

print(factor_cols)
Extract Factor Variables with more than 2 Unique Levels in R

Extracting Factor Variables with No Missing Values in R

Let's say you want to keep factor variables that have no missing values in R.

# Create a sample data frame
mydata <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Jon"),
  city = c("Los Angeles", "New York", "Dallas", NA),
  height = c(165.5, 180.0, 172.3, 181)
)

Base R

factor_cols <- sapply(mydata, is.factor)
factor_no_missing <- colSums(is.na(mydata[factor_cols])) == 0
factor_no_missing_cols <- mydata[factor_cols] [factor_no_missing]

Here's a step-by-step breakdown of the code:

  1. factor_cols <- sapply(mydata, is.factor):
    • This line creates a logical vector factor_cols where each element corresponds to a column in the dataframe mydata.
    • It checks whether each column is factor using the is.factor() function.
  2. factor_no_missing <- colSums(is.na(mydata[factor_cols])) == 0:
    • This line calculates a logical vector factor_no_missing which indicates for each factor column whether it has no missing values (NA).
    • mydata[factor_cols] subsets the original dataframe to include only the factor columns.
    • is.na(mydata[factor_cols]) creates a logical dataframe with TRUE where there are missing values and FALSE otherwise.
    • colSums(is.na(mydata[factor_cols])) calculates the count of missing values in each factor column.
    • colSums(is.na(mydata[factor_cols])) == 0 checks whether the count of missing values in each column is equal to zero.
  3. factor_no_missing_cols <- mydata[factor_cols][factor_no_missing]:
    • This line creates a new dataframe factor_no_missing_cols.
    • mydata[factor_cols] subsets the original dataframe to include only the factor columns.
    • [factor_no_missing] then further subsets these factor columns using the factor_no_missing logical vector.
    • This subset operation effectively keeps only the columns that are both factor and have no missing values.

dplyr

If you want to keep columns that have no missing values, you can use the select() function with where() in dplyr. select(where(is.factor)) selects only the factor columns. select(where(~ all(!is.na(.)))) selects columns where all values are not missing (NA).

library(dplyr)

factor_no_missing_cols <- mydata %>%
  select(where(is.factor)) %>%
  select(where(~ all(!is.na(.))))
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

0 Response to "How to Extract Factor Variables from DataFrame in R"

Post a Comment

Next → ← Prev
Looks like you are using an ad blocker!

To continue reading you need to turnoff adblocker and refresh the page. We rely on advertising to help fund our site. Please whitelist us if you enjoy our content.