This article talks about how we can correct multicollinearity problem with correlation matrix.
In caret package, there is a function called findCorrelation that helps to identify correlated variables.
How it works -
The absolute values of pair-wise correlations are considered. If some variables have a high correlation, the function looks at the mean absolute correlation of each variable and keeps only the variable with the smallest mean absolute correlation and remove the larger absolute correlation.Example - Correlation Matrix
X1 | X2 | X3 | X4 | X5 | |
---|---|---|---|---|---|
X1 | 1.00 | 0.95 | 0.89 | 0.85 | 0.10 |
X2 | 0.95 | 1.00 | 0.85 | 0.81 | 0.09 |
X3 | 0.89 | 0.85 | 1.00 | 0.78 | 0.10 |
X4 | 0.85 | 0.81 | 0.78 | 1.00 | 0.09 |
X5 | 0.10 | 0.09 | 0.10 | 0.09 | 1.00 |
Variables to remove from X1 to X4 cluster - "X1" "X2" "X3" as they have larger mean absolute correlation than X4.
R Code -
# Identifying numeric variables
numericData <- dat2[sapply(dat2, is.numeric)]
# Calculate correlation matrix
descrCor <- cor(numericData)
# find attributes that are highly corrected
highlyCorrelated <- findCorrelation(descrCor, cutoff=0.7)
HI Deepanshu, I was trying to understand your explanation and I wanted to ask you what do you mean with mean absolute correlation?
ReplyDelete