How to Identify and Treat Outliers in R

This post explains how to identify and treat outliers in R.

What are Outliers?

Outliers are data points that significantly deviate from the general pattern within a dataset. They may be because of data entry errors, measurement error or rate events. Outliers can have a negative influence on the results of statistical analyses. Hence it is important to identify and treat them.

How to Identify Outliers?

We can identify outliers using the box plot method. Here we calculate the lower and upper bounds for identifying outliers.

```Lower Bound = Q1 - 1.5 * (Q3 - Q1)
Upper Bound = Q3 + 1.5 * (Q3 - Q1)
```
Any value below the Lower Bound is considered an outlier. Any value above the Upper Bound is considered an outlier.

How to Treat Outliers?

To correct outlier problem, we can winsorise extreme values. In simple words, it means percentile-based data capping. Winsorize at the 1st and 99th percentile means values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile.

Create Sample DataFrame

Let's create a sample dataframe for demonstration purposes. The following code creates a sample dataframe named "mydata" which contains 100 rows and 10 columns. Later we introduced outliers in the first and third column so that you understand it clearly.

```# Create a sample dataframe
set.seed(123)  # for reproducibility
mydata <- data.frame(
matrix(runif(100 * 10, 0, 100), ncol = 10)
)

# Introduce outliers for demonstration purpose
mydata[25, 1] <- 200  # Introduce an outlier in the 1st column
mydata[50, 3] <- -100  # Introduce an outlier in the 3rd column
```

R Function to Identify Outliers

The purpose of the function is to identify and collect outliers from the numeric columns of the input data frame.

```identify_outliers <- function(data) {
outliers <- list()

for (i in which(sapply(data, is.numeric))) {
col_name <- names(data)[i]

perc <- quantile(data[, i], c(.25, .75 ), na.rm =TRUE)
lower_fence <- perc[1] - 1.5 * IQR(data[, i])
upper_fence <- perc[2] + 1.5 * IQR(data[, i])

outlier_indices <- which(data[, i] < lower_fence | data[, i] > upper_fence)
outliers[[col_name]] <- data[outlier_indices, i]
}

outliers2 <- outliers[sapply(outliers, length)>0]

return(outliers2)

}

outliers_vars <- identify_outliers(mydata)
```
Output
```\$X1
[1] 200

\$X3
[1] -100
```

R Function to Treat Outliers

The following function `pcap`, which stands for "Percentile Capping", limits extreme values in the numeric columns of the dataframe based on percentiles. This function takes a dataframe (df) as input, along with optional arguments vars and percentiles.

```pcap <- function(df, vars= NULL, percentiles = c(.01, .99)){

if(is.null(vars)) {
vars_index <- which(sapply(df, is.numeric))
} else {
vars_index <- which(names(df) %in% vars)
}

for (i in vars_index) {
quantiles <- quantile( df[,i], percentiles, na.rm =TRUE)
df[,i] = ifelse(df[,i] < quantiles[1] , quantiles[1], df[,i])
df[,i] = ifelse(df[,i] > quantiles[2] , quantiles[2], df[,i])}

return(df)

}

# Replacing extreme values with percentiles
myvars = names(outliers_vars) # column names where outliers exist
mydata2 = pcap(mydata, vars = myvars)
```

By default, it caps extreme values at the 1st percentile as the lower limit and the 99th percentile as the upper limit. You can set different percentile limits in the percentiles argument of the function.

Validate Percentile Capping

You can compare the percentile values before and after percentile capping to confirm the treatment of outliers in the dataframe.

```# Checking Percentile values of 1st variable
quantile(mydata[,1], c(0.99,1), na.rm = TRUE)
quantile(mydata2[, 1], c(0.99,1), na.rm = TRUE)
```
```# Checking Percentile values of 3rd variable
quantile(mydata[,3], c(0,0.01), na.rm = TRUE)
quantile(mydata2[, 3], c(0,0.01), na.rm = TRUE)
```
How to Customize the Function

Suppose you want to set 5th and 95th percentiles as capping values.

`pcap(mydata, vars = names(outliers_vars), percentiles = c(.05, .95))`

In case you want to perform percentile capping on all the numeric columns of a dataframe.

`mydata2 = pcap(mydata)`
Related Posts
Share