This tutorial explains when, why and how to standardize a variable in statistical modeling. Variable Standardization is one of the most important concept of predictive modeling. It is a preprocessing step in building a predictive model. Standardization is also called Normalization and Scaling.

The concept of standardization comes into picture when continuous independent variables are measured at different scales. It means these variables do not give equal contribution to the analysis. For example, we are performing customer segmentation analysis in which we are trying to group customers based on their homogenous (similar) attributes. A variable called 'transaction amount' that ranges between $100 and $10000 carries more weightage as compared to a variable i.e. number of transactions that in general ranges between 0 and 30. Hence, it is required to transform the data to comparable scales.

There are main four methods of standardization. They are as follows -

Z score standardization is one of the most popular method to normalize data. In this case, we rescale an original variable to have a

Mathematically, scaled variable would be calculated by subtracting mean of the original variable from raw vale and then divide it by standard deviation of the original variable.

0 0 1 1

In this method, we divide each value by the standard deviation. The idea is to have

In this method, we dividing each value by the range.

var_x2 = 0.08833861

Centering means subtracting a constant value from every value of a variable. The constant value can be average, min or max. Most of the times we use average value to subtract it from every value.

1. It is important to standardize variables before running

2. Prior to

3. It is required to standardize variable before using

4. All SVM kernel methods are based on distance so it is required to scale variables prior to running final

5. It is necessary to standardize variables before using

6. In regression analysis, we can calculate

7. In regression analysis, when an interaction is created from two variables that are not centered on 0, some amount of collinearity will be induced.

8. In regression analysis, it is also helpful to standardize a variable when you include power terms X². Standardization removes collinearity.

1. If you think model performance of linear regression model would improve if you standardize variables,

In the program below, we are first preparing a sample test dataset which is used later for prediction.

In the R script below, we are first storing mean and standard deviation of variables of training dataset in two separate numeric vectors. Later these vectors are used to standardize training dataset.

To standardize validation and test dataset, we can use mean and standard deviation of independent variables from training data. Later we apply them to test dataset using Z-score formula. See the formula below -

In the following code, we are using mean and standard deviation of training data which is used to calculate Z score on test data.

**Standardization / Scaling**The concept of standardization comes into picture when continuous independent variables are measured at different scales. It means these variables do not give equal contribution to the analysis. For example, we are performing customer segmentation analysis in which we are trying to group customers based on their homogenous (similar) attributes. A variable called 'transaction amount' that ranges between $100 and $10000 carries more weightage as compared to a variable i.e. number of transactions that in general ranges between 0 and 30. Hence, it is required to transform the data to comparable scales.

**The idea is to rescale an original variable to have equal range and/or variance.**Standardization / Scaling |

**Methods of Standardization / Normalization**

There are main four methods of standardization. They are as follows -

**1. Z score**

Z score standardization is one of the most popular method to normalize data. In this case, we rescale an original variable to have a

**mean of zero**and

**standard deviation of one**.

Z score |

**R Code : Standardize a variable using Z-score**

# Creating a sample data

set.seed(123)

X =data.frame(k1 = sample(100:1000,1000, replace=TRUE),

k2 = sample(10:100,1000, replace=TRUE))

X.scaled = scale(X, center= TRUE, scale=TRUE)

In scale() function,

It is also called

The formula is shown below -

**center= TRUE**implies subtracting the mean from its original variable. The**scale = TRUE**implies dividing the centered column by its standard deviations.**Check Mean and Variance of Standardized Variable**

colMeans(X.scaled)

**Result :**0 for both k1 and k2

var(X.scaled)

**Result :**1 for both k1 and k2

**Interpretation**

A value of 1 implies that the value for that case is one standard deviation above the mean, while a value of -1 indicates that a case has a value one standard deviations lower than the mean.

**Important Point**

The standardized values do not lie in a particular interval. It can be any real number.

**2. Min-Max Scaling**

It is also called

**0-1 scaling**because the standardized value using this method lies between 0 and 1.The formula is shown below -

x-min(x)/(max(x)-min(x))This method is used to make

**equal ranges**but different means and standard deviations.library(dplyr)

mins= as.integer(summarise_all(X, min))

rng = as.integer(summarise_all(X, function(x) diff(range(x))))

X.scaled = data.frame(scale(X, center= mins, scale=rng))

**Check Min and Max of standardized variables**

summarise_all(X.scaled, funs(min, max))k1_min k2_min k1_max k2_max

0 0 1 1

**3. Standard Deviation Method**

In this method, we divide each value by the standard deviation. The idea is to have

**equal variance**, but different means and ranges.

**Formula :**x/stdev(x)

X.scaled = data.frame(scale(X, center= FALSE , scale=apply(X, 2, sd, na.rm = TRUE)))

**Check Equal Variance**

summarise_all(X.scaled, var)

**Result :**1 for both the variables

**4. Range Method**

In this method, we dividing each value by the range.

**Formula :**x /(max(x) - min(x)). In this case, the means, variances, and ranges of the variables are still different, but at least the ranges are likely to be more similar.

library(dplyr)var_x1 = 0.08614377

rng = as.integer(summarise_all(X, function(x) diff(range(x))))

X.scaled = data.frame(scale(X, center= FALSE, scale=rng))

summarise_all(X.scaled, var)

var_x2 = 0.08833861

**What is Centering?**

Centering means subtracting a constant value from every value of a variable. The constant value can be average, min or max. Most of the times we use average value to subtract it from every value.

X=sample(1:100,1000, replace=TRUE)

scale(X,center = TRUE, scale=FALSE)

By default, scale() function with center=TRUE subtract mean value from values of a variable.

**When it is important to standardize variables?**

1. It is important to standardize variables before running

**Cluster Analysis**. It is because cluster analysis techniques depend on the concept of measuring the distance between the different observations we're trying to cluster. If a variable is measured at a higher scale than the other variables, then whatever measure we use will be overly influenced by that variable.

2. Prior to

**Principal Component Analysis**, it is critical to standardize variables. It is because PCA gives more weightage to those variables that have higher variances than to those variables that have very low variances. In effect the results of the analysis will depend on what units of measurement are used to measure each variable. Standardizing raw values makes equal variance so high weight is not assigned to variables having higher variances.

3. It is required to standardize variable before using

**k-nearest neighbors**with an Euclidean distance measure. Standardization makes all variables to contribute equally.

4. All SVM kernel methods are based on distance so it is required to scale variables prior to running final

**Support Vector Machine**(

**SVM)**model.

5. It is necessary to standardize variables before using

**Lasso and Ridge Regression.**Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. The result of centering the variables means that there is no longer an intercept. This applies equally to ridge regression.

6. In regression analysis, we can calculate

**importance of variables**by ranking independent variables based on the descending order of absolute value of standardized coefficient.

7. In regression analysis, when an interaction is created from two variables that are not centered on 0, some amount of collinearity will be induced.

**Centering first addresses this potential problem**. In simple terms, having non-standardized variables interact simply means that when X1 is big, then X1X2 is also going to be bigger on an absolute scale irrespective of X2, and so X1 and X1X2 will end up correlated.

8. In regression analysis, it is also helpful to standardize a variable when you include power terms X². Standardization removes collinearity.

**When it is not required to standardize variables**

1. If you think model performance of linear regression model would improve if you standardize variables,

**it is absolutely incorrect!**It does not change RMSE, R-squared value, Adjusted R-squared value, p-value of coefficients. See the detailed R script below. It shows standardization does not affect model performance at all.

**Without Standardization -**

# Create Sample Data

set.seed(123)

train <- data.frame(X1=sample(1:100,1000, replace=TRUE),

X2=1e2*sample(1:500,1000, replace=TRUE),

X3=1e-2*sample(1:100,1000, replace=TRUE))

train$y <- with(train,2*X1 + 3*1e-2*X2 - 5*1e2*X3 + 1 + rnorm(1000,sd=10))

#Fit linear regression model

fit <- lm(y~X1+X2+X3,train)

summary(fit)

Call: lm(formula = y ~ X1 + X2 + X3, data = train) Residuals: Min 1Q Median 3Q Max -30.558 -6.456 -0.118 6.654 32.519 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 1.216e+00 9.732e-01 1.25 0.212 X1 1.984e+00 1.089e-02 182.19 2e-16 *** X2 3.000e-02 2.188e-05 1371.21 2e-16 *** X3 -4.990e+02 1.070e+00 -466.21 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.849 on 996 degrees of freedom Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995 F-statistic: 6.799e+05 on 3 and 996 DF, p-value: < 2.2e-16

**Predict on Test Dataset**

In the program below, we are first preparing a sample test dataset which is used later for prediction.

# create test dataset

set.seed(456)

test <- data.frame(X1=sample(-5:5,100,replace=TRUE),

X2=1e2*sample(-5:5,100, replace=TRUE),

X3=1e-2*sample(-5:5,100, replace=TRUE))

# predict y based on test data without standardization

pred <- predict(fit,newdata=test)

head(cbind(test, pred))

X1 X2 X3 pred 1 -5 -300 0.01 -22.69496 2 -3 -400 0.02 -26.71734 3 3 -100 0.03 -10.80241 4 4 -200 -0.05 28.10335 5 3 300 0.00 16.16938 6 -2 300 -0.04 26.21004

**With Standardization**

In the R script below, we are first storing mean and standard deviation of variables of training dataset in two separate numeric vectors. Later these vectors are used to standardize training dataset.

# Standardize predictors

means <- sapply(train[,1:3],mean)

stdev <- sapply(train[,1:3],sd)

train.scaled <- as.data.frame(scale(train[,1:3],center=means,scale=stdev))

head(train.scaled)

train.scaled$y <- train$y

# Check mean and Variance of Standardized Variables

library(dplyr)

summarise_at(train.scaled, vars(X1,X2,X3), funs(round(mean(.),4)))

summarise_at(train.scaled, vars(X1,X2,X3), var)

**Result :**Mean is 0 and Variance is 1 for all the standardized variables. See the output below -

> summarise_at(train.scaled, vars(X1,X2,X3), funs(round(mean(.),4))) X1 X2 X3 1 0 0 0 > summarise_at(train.scaled, vars(X1,X2,X3), var) X1 X2 X3 1 1 1 1

#Fit Scaled Data

fit.scaled <- lm(y ~ X1 + X2 + X3, train.scaled)

summary(fit.scaled)

Call: lm(formula = y ~ X1 + X2 + X3, data = train.scaled) Residuals: Min 1Q Median 3Q Max -30.558 -6.456 -0.118 6.654 32.519 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 598.4244 0.3114 1921.4 2e-16 *** X1 57.0331 0.3130 182.2 2e-16 *** X2 428.6441 0.3126 1371.2 2e-16 *** X3 -145.8587 0.3129 -466.2 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.849 on 996 degrees of freedom Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995 F-statistic: 6.799e+05 on 3 and 996 DF, p-value: < 2.2e-16

**Compare Coefficients, R-Squared and Adjusted R-Squared**

The value of coefficients are not same when we run regression analysis with and without standardizing independent variables.It does not mean they are affected by scaling / standardization.The values are different because of these are the slopes - how much the target variable changes if you change independent variable by 1 unit. In other words, standardization can be interpreted as scaling the corresponding slopes.The adjusted r-squared and multiple r-squared value is exactly same.

**How to standardize validation / test dataset**

To standardize validation and test dataset, we can use mean and standard deviation of independent variables from training data. Later we apply them to test dataset using Z-score formula. See the formula below -

Z = (X_test - Xbar_training) / Stdev_training

**R Script - Standardize Test Data**

In the following code, we are using mean and standard deviation of training data which is used to calculate Z score on test data.

test.scaled <- as.data.frame(scale(test,center=means,scale=stdev))

head(test.scaled)

X1 X2 X3 1 -1.921060 -1.768154 -1.688987 2 -1.851484 -1.775153 -1.654774 3 -1.642756 -1.754155 -1.620561 4 -1.607968 -1.761155 -1.894264 5 -1.642756 -1.726157 -1.723200 6 -1.816696 -1.726157 -1.860051

**Compare Prediction - Scaled vs Unscaled**

# predict y based on new data scaled, with fit from scaled dataset

pred.scaled <- predict(fit.scaled,newdata=test.scaled)

# Compare Prediction - unscaled vs. scaled fit

all.equal(pred,pred.scaled)

> head(cbind(pred,pred.scaled),n=10) pred pred.scaled 1 -22.6949619 -22.6949619 2 -26.7173411 -26.7173411 3 -10.8024050 -10.8024050 4 28.1033470 28.1033470 5 16.1693841 16.1693841 6 26.2100374 26.2100374 7 0.2968679 0.2968679 8 -1.7414468 -1.7414468 9 29.2162169 29.2162169 10 8.2025365 8.2025365

**As you can see above both prediction values are exact same.**

**Compare RMSE Score**

# RMSE on train data with un-scaled fitRMSE is same in both the cases

pred_train <- predict(fit,newdata=train)

rmse <- sqrt(mean((train$y - pred_train)^2))

# RMSE on train data with scaled fit

pred_train.scaled <- predict(fit.scaled,newdata=train.scaled)

rmse.scaled <- sqrt(mean((train$y - pred_train.scaled)^2))

# Compare RMSE

all.equal(rmse,rmse.scaled)

**9.829196.**It is because RMSE is associated with scale of Y (target variable). Prediction is also unchanged.

**Interpretation of Standardized Regression Coefficient**

Most of modern statistical softwares automatically produces standardized regression coefficient. It is important metrics to rank predictors. Its interpretation is slightly different from unstandardized estimates. Standardized coefficients are interpreted as the number of standard deviation units Y changes with an increase in one standard deviation in X.

**Correlation with or without Centering / Standardization**The correlation score does not change if you perform correlation analysis on centered and uncentered data.

X=sample(1:100,1000, replace=TRUE)

Y=1e2*sample(1:500,1000, replace=TRUE)

cor(X,Y)

cor(X-mean(X),Y-mean(X))

**Standardization after missing imputation and outlier treatment**

Centering and Scaling data should be done after imputing missing values. It is because the imputation could influence correct center and scale to use. Similarly, outlier treatment should be done prior to standardization.

**Standardize Binary (Dummy) Variables**

- Standardizing binary variables makes interpretation of binary variables vague as it cannot be increased by a standard deviation.
**The simplest solution is :**not to standardize binary variables but code them as 0/1, and then standardize all other continuous variables by dividing by two standard deviation. It would make them approximately equal scale. The standard deviation of both the variables would be approx. 0.5 - Some researchers are in favor of standardizing binary variables as it would make all predictors on same scale. It is a standard practice in penalized regression (lasso). In this case, researchers ignore the interpretation of variables.

**Standardization and Tree Algorithms and Logistic Regression**

Standardization does not affect logistic regression, decision tree and other ensemble techniques such as random forest and gradient boosting.

Useful article. Although you did a good job explaining why and when you might want to standardize a variable, you don't mention what criteria to use for actually selecting a standardizing method. For example, under what circumstances would it be better to use Z score vs. Min/Max?

ReplyDeleteThere is no thumb rule regarding the standardization method. It depends on your dataset. You need to apply methods and see which method works for you. Thanks!

DeleteThis is one of the best tutorial. Thanks for the great effort.

ReplyDeletehow do one standardize variables when the feature variables have different data types, can we go with one method for each feature and still try out different methods on different features, is that a correct option or

ReplyDeletea) use only one method of standardization in a case where different data types are available as part of standardization- say centring-- by subtracting the means - but what if the feature is categorical- can we subtract mode instead or should we follow a common procedure

I am novice in Data Science, Could you please also mention packages which needs to import for doing these calculations.

ReplyDeleteWhat is the need of performing a standardization.If is not a implementing our model performance.

ReplyDeleteCongratulations on your comprehensive and easy-readable post on something so important for any modern data engineer. I would like to add from personal experience that certain monotonic functions(e.g. cubic root) can be used after subtracting each variable's mean over the used sample (always after error correction and imputation) in a linear regression in order to: a) efficiently scale well any outliers b) efficiently compare any measures of different scales

ReplyDeletec) linearly & possibly non-linearly detrend your variable (needed for stationarity assumptions in time series models)

Keep up the good work!

Congratulations on your comprehensive and easy-readable post on something so important for any modern data engineer. I would like to add from personal experience that certain monotonic functions(e.g. cubic root) can be used after subtracting each variable's mean over the used sample (always after error correction and imputation) in a linear regression in order to: a) efficiently scale well any outliers b) efficiently compare any measures of different scales

ReplyDeletec) linearly & possibly non-linearly detrend your variable (needed for stationarity assumptions in time series models)

Keep up the good work!

Normalization won't improve model performance, but it will affect the values of MAE, MSE and RMSE, but not MAPE. See below example of the metrics from the same model with the same observed and predicted values, but with results in dollars (left) and pesos (right). This is the scalability problem of the metrics.

ReplyDeleteMAE 0,383 47,048

MSE 0,247 3741,780

RMSE 0,497 61,170

MAPE 1,33 1,33

Does it matter if the variables that you are scaling are normally distributed or not?

ReplyDeleteWhat if the range is zero for some observations? In the case of the range method for example, the divisor would be zero for these observations. How can we accommodate for this?

ReplyDeleteRange cannot be zero. A variable with zero range is a constant.

ReplyDeleteThank you very much for your blog. Your blog is the most comprehensive and detailed explanation of why scaling is done among the resources I can find on the Internet.

ReplyDelete