# Linear Regression with SAS

Linear regression helps us understand the relationship between two or more variables. It is called "linear" because it assumes that the relationship between the variables are linear.

Linear regression explains how one variable (called the "dependent variable") changes in response to changes in another variable or set of variables (called the "independent variables").

Simple Linear Regression

The simple linear regression model is a linear equation of the following form:

`y = a + bx`
• y: Dependent variable. It's the outcome we want to predict.
• a: Intercept. It is the value of y when x = 0.
• b: Slope or Coefficient. For each unit increase in x, y changes by the amount represented by the slope.
• x: Independent variable. It's the variable that explains the changes in the dependent variable y. We use it to make predictions or analyze the relationship with y.
Example: Simple Linear Regression

Suppose you want to see the relationship between the amount of time a person spends studying and their exam scores. By using linear regression, you can understand how much impact study time has on exam scores.

`Exam Score = a + b(Study Hours)`

Here Slope (b) is the coefficient that shows how much the exam score is expected to increase (or decrease) for every one-unit increase in study hours.

Multiple Linear Regression

The multiple linear regression model is a linear equation of the following form:

`y = b0 + b1x1 + b2x2 + ... + bnxn`
• y: Dependent variable. It's the outcome we want to predict.
• b0: Intercept. It is the value of y when all independent variables are 0.
• b1, b2, ..., bn: Coefficients (Slopes). Each b represents the change in y for a one-unit increase in the corresponding independent variable x1, x2, ..., xn.
• x1, x2, ..., xn: Independent variables. They are the variables that explain the changes in the dependent variable y. We use them to make predictions or analyze the relationship with y.
Example: Multiple Linear Regression

Suppose you want to understand the factors that influence housing prices. You have collected data on various variables such as square footage, number of bedrooms, and distance from the city center. By using multiple linear regression, you can predict how these variables collectively impact the housing prices.

`Housing Price = b0 + b1(Square Footage) + b2(Number of Bedrooms) + b3(Distance from City Center)`

Here each coefficient b1, b2, and b3 shows how much the housing price is expected to increase (or decrease) for a one-unit increase in square footage, number of bedrooms, and distance from the city center, respectively.

Sample SAS Dataset

Let's create a sample dataset for the examples in this article. The following code creates a dataset containing 3 variables and 30 observations. These variables SqFoot, Distance, Price represent Square Footage, Distance from City Center and Housing Price.

```data HousingPrices;
input SqFoot Distance Price;
datalines;
1500 10 300000
2000 15 350000
1800 12 320000
2200 18 380000
2400 20 400000
1700 8 290000
1750 9 295000
2100 17 370000
1900 11 330000
2300 16 390000
1950 11 335000
2050 14 355000
2250 17 375000
1850 10 315000
2150 15 360000
1850 9 310000
1650 7 280000
1950 11 335000
2150 18 380000
2350 21 410000
2050 14 350000
1750 6 285000
2000 10 325000
2200 19 395000
1850 12 330000
2250 20 405000
1900 11 325000
2400 18 395000
1950 13 335000
1850 8 305000
;
run;
```

## How to Calculate Linear Regression in SAS

In SAS, there are multiple ways to build a linear regression model, and the procedures for linear regression are as follows:

1. PROC REG
2. PROC GLM
3. PROC GLMSELECT

### PROC REG : Linear Regression

The following code uses the PROC REG procedure to build a linear regression model. The variable "Price" is the dependent variable, and "SqFoot" and "Distance" are the independent (predictor) variables in the linear regression model.

```ods output ParameterEstimates = estimates;
proc reg data=HousingPrices;
model Price = SqFoot Distance;
run;
```

The ods output statement with the ParameterEstimates table is used to create a SAS dataset that saves the output of parameter estimates. Here we have created a dataset called "estimates".

Interpret SAS Output of Linear Regression Model
Analysis of Variance Table

The p-value of the whole model is <.0001. Since this p-value is less than 0.05, it means the whole regression model is statistically significant.

Model Fit Table

The R-Square tells us the proportion of variation in the housing prices that can be explained by the square footage and the distance from city center. Adjusted R-squared is a more robust metric than R-squared as it considers the number of predictors and prevents overfitting in the regression model.

Higher the adjusted R-squared value, the better the independent variables are able to predict the value of the dependent variable. In this example, 0.9823 or 98.23% of the variation in the housing prices that can be explained by the square footage and the distance from city center.

Parameter Estimates Table

The regression equation can be formed based on the parameter estimates

`Housing Price = 139989 + 63.65716*(square footage) + 5769.28237*(distance from city center)`

The p-value for both the independent variables is less than 0.05, which means they are statistically significant.

### PROC GLM : Linear Regression

The PROC GLM has many similarities with the PROC REG procedure in terms of building a regression model. However, PROC REG is specialized for linear regression analysis with one or more continuous independent variables. Whereas the PROC GLM can handle both the categorical and continuous independent variables.

```ods output ParameterEstimates = estimates;
proc glm data=HousingPrices;
model Price = SqFoot Distance;
run;
```

To understand the output of the linear regression model, refer this section.

### PROC GLMSELECT : Linear Regression

The benefits of using PROC GLMSELECT over PROC REG and PROC GLM for building a linear regression model are as follows:

1. Handling categorical and continuous variables: PROC GLMSELECT supports categorical variables selection with CLASS statement. Whereas, PROC REG does not support CLASS statement.
2. Automated variable selection: PROC GLMSELECT supports BACKWARD, FORWARD, STEPWISE variable selection techniques. Whereas, PROC GLM does not support these techniques.
```ods output ParameterEstimates = estimates;
PROC GLMSELECT data=HousingPrices;
model Price = SqFoot Distance;
run;
```
How to Use Categorical Variables in Linear Regression in SAS

The CLASS statement is used to handle categorical independent variable.

```PROC GLMSELECT data=dataset-name;
class categorical_variable / param=ref order=data;
model dependent_variable = variable1  categorical_variable
/ selection=stepwise select=SL showpvalues stats=all STB;
run;
```
• selection=stepwise: This indicates that the stepwise variable selection method will be used to determine the significant predictors in the model.
• select=SL: This sets the significance level (SL) for entry and removal of variables during the stepwise selection process.
• showpvalues: This option displays the p-values for the selected variables in the output.
• stats=all: This specifies that all relevant statistics, such as parameter estimates, standard errors, and confidence intervals, will be displayed.
• STB: This option requests the standardized regression coefficients to be displayed in the output.
How to Generate Predicted Values using Validation Dataset

score data = validation out= pred_val; This line generates predicted values for the model using the "validation" dataset and stores the predicted values in a dataset named "pred_val."

```PROC GLMSELECT data=dataset-name;
class categorical_variable / param=ref order=data;
model dependent_variable = variable1  categorical_variable
/ selection=stepwise select=SL showpvalues stats=all STB;
score data = validation out= pred_val;
run;
```
Scoring Linear Regression Model with SAS
How to Select the Best Linear Regression Model

The CHOOSE= option allows to choose from the list of models at the steps of the selection process based on the best value of the specified criterion (Adjusted RSquare).

```PROC GLMSELECT data=dataset-name;
class categorical_variable / param=ref order=data;
model dependent_variable = variable1  categorical_variable
run;
```

## How to Check and Fix Assumptions of Linear Regression

The assumptions of linear regression are as follows:

Errors (Residuals) should be normally distributed

The residuals should follow a normal distribution.
Check: Create a histogram or Q-Q plot of the residuals; they should look like a bell-shaped curve.
Fix: If the residuals are not normally distributed, consider transforming the dependent variable.

```PROC GLMSELECT data=dataset-name;
class categorical_variable / param=ref order=data;
model dependent_variable = variable1  categorical_variable
output out=stdres p= predict r = resid;
run;

proc univariate data=stdres normal;
var resid;
run;
```

The `NORMAL` option in PROC UNIVARIATE generates a normality plot (Q-Q plot) to assess whether the residuals follow a normal distribution.

Homoscedasticity

Homoscedasticity means the variability of residuals should be constant across all levels of the predictors.
Check: Plot the residuals against the predicted values; residuals should be evenly spread around zero, with no cone-like shape.
Fix: If heteroscedasticity is present, consider transforming the dependent variable or using weighted least squares.

```PROC GLMSELECT data=dataset-name;
class categorical_variable;
model dependent_variable = variable1  categorical_variable / archtest;
output out=r r=yresid;
run;
```
Note : Check P-value of Q statistics and LM tests. P-value greater than .05 indicates homoscedasticity.

Linearity: The relationship between the dependent variable and each independent variable should be approximately linear.

• Check: Plot the residuals against each predictor variable; residuals should show a random scatter pattern around zero.
• Fix: Consider transforming variables or using polynomial terms to achieve linearity.

Multicollinearity: Predictor variables should not be highly correlated with each other.

• Check: Calculate the variance inflation factor (VIF) for each predictor; VIF values above 5 or 10 indicate potential multicollinearity.
• Fix: If multicollinearity is detected, consider removing or combining correlated predictors.

Outliers: Identify extreme data points that may exert a significant influence on the model.

• Check: Plot the standardized residuals against the predicted values; look for points that deviate substantially from the others.
• Fix: If outliers are influential, consider excluding them or applying robust regression methods.