This post explains how to identify and treat outliers in R.
What are Outliers?
Outliers are data points that significantly deviate from the general pattern within a dataset. They may be because of data entry errors, measurement error or rate events. Outliers can have a negative influence on the results of statistical analyses. Hence it is important to identify and treat them.
How to Identify Outliers?
We can identify outliers using the box plot method. Here we calculate the lower and upper bounds for identifying outliers.
Lower Bound = Q1 - 1.5 * (Q3 - Q1) Upper Bound = Q3 + 1.5 * (Q3 - Q1)
Any value below the Lower Bound is considered an outlier. Any value above the Upper Bound is considered an outlier.
How to Treat Outliers?
To correct outlier problem, we can winsorise extreme values. In simple words, it means percentile-based data capping. Winsorize at the 1st and 99th percentile means values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile.
Let's create a sample dataframe for demonstration purposes. The following code creates a sample dataframe named "mydata" which contains 100 rows and 10 columns. Later we introduced outliers in the first and third column so that you understand it clearly.
# Create a sample dataframe set.seed(123) # for reproducibility mydata <- data.frame( matrix(runif(100 * 10, 0, 100), ncol = 10) ) # Introduce outliers for demonstration purpose mydata[25, 1] <- 200 # Introduce an outlier in the 1st column mydata[50, 3] <- -100 # Introduce an outlier in the 3rd column
R Function to Identify Outliers
The purpose of the function is to identify and collect outliers from the numeric columns of the input data frame.
identify_outliers <- function(data) { outliers <- list() for (i in which(sapply(data, is.numeric))) { col_name <- names(data)[i] perc <- quantile(data[, i], c(.25, .75 ), na.rm =TRUE) lower_fence <- perc[1] - 1.5 * IQR(data[, i]) upper_fence <- perc[2] + 1.5 * IQR(data[, i]) outlier_indices <- which(data[, i] < lower_fence | data[, i] > upper_fence) outliers[[col_name]] <- data[outlier_indices, i] } outliers2 <- outliers[sapply(outliers, length)>0] return(outliers2) } outliers_vars <- identify_outliers(mydata)Output
$X1 [1] 200 $X3 [1] -100
R Function to Treat Outliers
The following function pcap
, which stands for "Percentile Capping", limits extreme values in the numeric columns of the dataframe based on percentiles. This function takes a dataframe (df) as input, along with optional arguments vars and percentiles.
pcap <- function(df, vars= NULL, percentiles = c(.01, .99)){ if(is.null(vars)) { vars_index <- which(sapply(df, is.numeric)) } else { vars_index <- which(names(df) %in% vars) } for (i in vars_index) { quantiles <- quantile( df[,i], percentiles, na.rm =TRUE) df[,i] = ifelse(df[,i] < quantiles[1] , quantiles[1], df[,i]) df[,i] = ifelse(df[,i] > quantiles[2] , quantiles[2], df[,i])} return(df) } # Replacing extreme values with percentiles myvars = names(outliers_vars) # column names where outliers exist mydata2 = pcap(mydata, vars = myvars)
By default, it caps extreme values at the 1st percentile as the lower limit and the 99th percentile as the upper limit. You can set different percentile limits in the percentiles argument of the function.
You can compare the percentile values before and after percentile capping to confirm the treatment of outliers in the dataframe.
# Checking Percentile values of 1st variable quantile(mydata[,1], c(0.99,1), na.rm = TRUE) quantile(mydata2[, 1], c(0.99,1), na.rm = TRUE)
# Checking Percentile values of 3rd variable quantile(mydata[,3], c(0,0.01), na.rm = TRUE) quantile(mydata2[, 3], c(0,0.01), na.rm = TRUE)
Suppose you want to set 5th and 95th percentiles as capping values.
pcap(mydata, vars = names(outliers_vars), percentiles = c(.05, .95))
In case you want to perform percentile capping on all the numeric columns of a dataframe.
mydata2 = pcap(mydata)
Hey thanks for your post. I tried your code but it gave an error. I am trying to pass a data frame as an argument and winsorise each column. I copied your code and the following error was displayed:
ReplyDeleteError in check_names_df(j, x) : object 'i' not found
Any help would be appreciated. Thanks
Thanks for the code!!.... it worked well and helped me solving a great mess.
ReplyDelete