In this tutorial, we will explain multiple ways to create dummy variables from a categorical variable in R.
What is Dummy Coding?
When you have a categorical variable with k levels, you can create k-1 dummy variables to represent it in a regression model. This is called as dummy coding.
Example - Let's say you have a categorical variable called "continents" which has 4 levels: "Americas", "Europe", "Asia", "Africa". In this case, you can create 3 dummy variables to represent it. One common way to do this is to choose one level as the reference level (usually the first or the last level) and create dummy variables for the other levels. You can select any level of the categorical variable as the reference level. In this example, we selected "Africa" as reference level which is represented by 0s in all the three dummy variables.
Continents | Continent_Americas | Continent_Europe | Continent_Asia |
---|---|---|---|
1 (Americas) | 1 | 0 | 0 |
2 (Europe) | 0 | 1 | 0 |
3 (Asia) | 0 | 0 | 1 |
4 (Africa) | 0 | 0 | 0 |
The following code creates a sample dataframe for demonstration purposes. It has a column called "status" which contains 5 unique values - "Very High" "High" "Medium" "Low" "Very Low"
df <- data.frame(status = c("Very High", "Very High", "High", "High", "Medium", "Medium", "Low", "Low", "Very Low", "Very Low", "Very Low"))
Method 1: Create Dummy Variables using model.matrix()
Function
The model.matrix()
function in R is designed specifically for creating dummy variables in a dataframe based on factor levels. In the following code, we generated a matrix of dummy variables for the "status" column and its unique levels (-1 removes the intercept term from the output). Then we removed the last column to get k-1 dummy variables. The cbind() function then combines the original dataframe df with the newly created dummy columns. We set "Very Low" as reference level.
# Convert 'status' to a factor to ensure proper encoding df$status <- factor(df$status) # Create dummy variables using model.matrix() dummy_cols <- model.matrix(~status - 1, data = df) # Remove the last column of dummy_cols to get k-1 dummy variables dummy_cols <- dummy_cols[,-ncol(dummy_cols)] # Append the dummy columns to the original dataframe df2 <- cbind(df, dummy_cols)
Method 2: Create Dummy Variables using For Loop
The following code uses a loop to create dummy variables in a dataframe. It iterates through each unique level of a categorical variable except the first level and creates a new column for each level with a binary indicator (1 or 0) representing whether the original "status" column has that particular level.
for(level in unique(df$status)[-1]) { df[paste("status", level, sep = "_")] <- ifelse(df$status == level, 1, 0) }
Method 3: Create Dummy Variables using ifelse
Function
We can also write multiple ifelse functions for creating dummy variables. This method is not efficient when you have a lot of unique categories in a categorical variable (let's say "state" or "city").
library(dplyr) df2 <- df %>% mutate(High = ifelse(status == "High", 1, 0), Medium = ifelse(status == "Medium", 1, 0), Low = ifelse(status == "Low", 1, 0), VeryLow = ifelse(status == "Very Low", 1, 0))
How to Use Dummy Variables in Regression Model
In this section we will show you how to use dummy variables in a regression model. We have a categorical variable called Education_Level which has 3 categories - "High School", "College", "Postgraduate". We will create 2 dummy variables by setting "Postgraduate" as a reference level. It means the level "Postgraduate" will be zero in both the dummy variables.
The following code fits a linear regression model (lm) to predict Income based on Age and Education level using the my_data dataset.
# Load the required library library(dplyr) # Create a sample dataset with 12 rows my_data <- data.frame( Age = c(25, 30, 35, 40, 45, 50, 27, 32, 37, 42, 47, 52), Education_Level = rep(c("High School", "College", "Postgraduate"), each = 4), Income = c(40000, 50000, 55000, 60000, 65000, 70000, 45000, 52000, 60000, 68000, 72000, 80000) ) # Use the "ifelse" function to create the dummy variables my_data <- my_data %>% mutate(High_School = ifelse(Education_Level == "High School", 1, 0), College = ifelse(Education_Level == "College", 1, 0)) # Fit the linear regression model model <- lm(Income ~ Age + High_School + College, data = my_data) # View the summary of the model summary(model)Output
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18405.26 2945.90 6.248 0.000246 *** Age 1159.43 63.83 18.164 8.67e-08 *** High_School -4836.81 1344.23 -3.598 0.007001 ** College -5043.41 1169.16 -4.314 0.002568 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1562 on 8 degrees of freedom Multiple R-squared: 0.9874, Adjusted R-squared: 0.9826 F-statistic: 208.5 on 3 and 8 DF, p-value: 6.224e-08Interpretation of the coefficients:
- The intercept (constant term) is 18405.26.
- The "Age" variable has a positive coefficient of 1159.43, suggesting that for each one year increase in "Age", "Income" is expected to increase by $1,159.43, all other variables being constant. Since the p-value is less than 0.05, this variable is a statistically significant predictor of income.
- The "High_School" variable has a negative coefficient of -4836.81, implying that person having a high school education is expected to earn less than the person having a post graduation degree by $4,836.81, all other variables being constant. Since the p-value is less than 0.05, this variable is a statistically significant predictor of income.
- The "College" variable also has a negative coefficient of -5043.41, indicating that person having a college education is expected to earn less than the person having a post graduation degree by $5,043.41. Since the p-value is less than 0.05, this variable is a statistically significant predictor of income.
- Residual Standard Error: This represents the estimated standard deviation of the residuals (the differences between the observed values and the predicted values). It gives an idea of the model's accuracy in predicting the response variable.
- Multiple R-squared: This is a measure of how well the regression model fits the data. It indicates the proportion of the variance in the response variable that is explained by the predictor variables. In this case, 98.74% of the variability in the response can be explained by the model.
- Adjusted R-squared: This is a modified version of the R-squared that takes into account the number of predictor variables and the sample size. It is useful when comparing models with different numbers of predictors.
- F-statistic: This is a test statistic that assesses the overall significance of the model. It compares the variability explained by the model to the variability not explained. Higher F-statistic values and lower p-values suggest that the model is significant and fits the data well.
- p-value: This is the probability of obtaining an F-statistic as extreme as the one calculated, assuming the null hypothesis (all coefficients are zero) is true. In this case, the very small p-value (6.224e-08) indicates that the overall model is highly significant.
Is there a faster data.table way to do that?
ReplyDelete