This tutorial explains how to explore data with

1.

2.

3.

4.

In the example below. we would use

1.

2.

Suppose you want only percentiles to be appeared in output window. By default,

The

Most of the statistical techniques assumes data should be normally distributed. It is important to check this assumption before running a model.

Histogram shows visually whether data is normally distributed.

Skewness is a measure of the degree of asymmetry of a distribution. If skewness is close to 0, it means data is normal.

A positive skewed data means that there are a few extreme large values which turns its mean to skew positively. It is also called right skewed.

The

In this test, the null hypothesis states the data is normally distributed.

With

Suppose you want to calculate 97.5 and 99.5 percentiles.

The Winsorized and Trimmed Means are insensitive to Outliers. They should be reported rather than mean when the data is highly skewed.

In the example below, we are calculating 20% Winsorized Mean.

In the example below, we are calculating 20% trimmed Mean.

It tests the null hypothesis that mean of the variable is equal to 0. The alternative hypothesis is that mean is not equal to 0. When you run PROC UNIVARIATE, it defaults generates sample t-test in 'Tests for Location' section of output.

Since p-value is less than 0.05. we reject the null hypothesis. It concludes the mean value of the variable is significantly different from zero.

PROC UNIVARIATE generates the following plots :

**PROC UNIVARIATE**. It is one of the most powerful SAS procedure for running descriptive statistics as well as checking important assumptions of various statistical techniques such as normality, detecting outliers. Despite various powerful features supported by PROC UNIVARIATE, its popularity is low as compared to**PROC MEANS**. Most of the SAS Analysts are comfortable running PROC MEANS to run summary statistics such as count, mean, median, missing values etc, In reality, PROC UNIVARIATE surpass PROC MEANS in terms of options supported in the procedure. See the main difference between the two procedures.**PROC UNIVARIATE vs. PROC MEANS**1.

**PROC MEANS**can calculate various percentile points such as 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 99th percentiles but it cannot calculate**custom percentiles**such as 20th, 80th, 97.5th, 99.5th percentiles. Whereas,**PROC UNIVARIATE**can run custom percentiles.2.

**PROC UNIVARIATE**can calculate**extreme observations**- the five lowest and five highest values. Whereas,**PROC MEANS**can only calculate MAX value.3.

**PROC UNIVARIATE**supports**normality tests**to check normal distribution. Whereas,**PROC MEANS**does not support normality tests.4.

**PROC UNIVARIATE**generates multiple**plots**such as histogram, box-plot, steam leaf diagrams whereas**PROC MEANS**does not support graphics.**Tutorial :****PROC MEANS with Examples****Basic PROC UNIVARIATE Code**In the example below. we would use

**sashelp.shoes**dataset.**SALES**is the numeric (or measured) variable.proc univariate data = sashelp.shoes;

var sales;

run;

**Default Output of PROC UNIVARIATE**1.

**Moments :**Count, Mean, Standard Deviation, SUM etc2.

**Basic Statistics :**Mean, Median, Mode etcDefault Output : PART I |

**3. Tests for Location :**one-sample t-test, Signed Rank test.

**4. Percentiles (Quantiles)**

**5. Extreme Observations -**first smallest and largest values against their row position.

Default Output : Part II |

**Example 1 : Analysis of Sales by Region**

Suppose you are asked to calculate basic statistics of sales by region. In this case, region is a grouping (or categorical) variable. The

**CLASS statement**is used to define categorical variable.proc univariate data = sashelp.shoes;

var sales;

class region;

run;

*See the output shown below -*PROC UNIVARIATE Class Statement |

*The similar output was generated for other regions - Asia, Canada, Eastern Europe, Middle East etc.*

**2. Generating only Percentiles in Output**Suppose you want only percentiles to be appeared in output window. By default,

**PROC UNIVARIATE**creates five output tables :

**Moments, BasicMeasures, TestsForLocation, Quantiles, and ExtremeObs**. The ODS SELECT can be used to select only one of the table. The

**Quantiles**is the standard table name of

**PROC UNIVARIATE**for percentiles which we want. ODS stands for Output Delivery System.

ods select Quantiles;

proc univariate data = sashelp.shoes;

var sales;

class region;

run;

**How to know the table names generated by SAS procedure**

**The ODS TRACE ON**produces name and label of tables that SAS Procedures generates in the log window.

ods trace on;

proc univariate data = sashelp.shoes;

var sales;

run;

ods trace off;

**How to write Percentile Information in SAS Dataset**

The

**ODS OUTPUT**statement is used to write output in results window to a SAS dataset. In the code below,

**temp**would be the name of the dataset in which all the percentile information exists.

ods output Quantiles = temp;

proc univariate data = sashelp.shoes;

var sales;

class region;

run;

ods output close;

**3. Calculating Extreme Values**

Like we generated percentiles in the previous example, we can generate extreme values with

**extremeobs**option. The**ODS OUTPUT**tells SAS to write the extreme values information to a dataset named**outlier**. The**"extremeobs"**is the standard table name of PROC UNIVARIATE for extreme values.ods output extremeobs = outlier;

proc univariate data = sashelp.shoes;

var sales;

class region;

run;

ods output close;

**4. Checking Normality**

Most of the statistical techniques assumes data should be normally distributed. It is important to check this assumption before running a model.

**There are multiple ways to check Normality :**

- Plot Histogram and see the distribution
- Calculate Skewness
- Normality Tests

**I. Plot Histogram**

Histogram shows visually whether data is normally distributed.

proc univariate data=sashelp.shoes NOPRINT;It also helps to check whether there is an outlier or not.

var sales;

HISTOGRAM / NORMAL(COLOR=RED);

run;

**II. Skewness**

Skewness is a measure of the degree of asymmetry of a distribution. If skewness is close to 0, it means data is normal.

Skewness |

A negative skewed data means that there are a few extreme small values which turns its mean to skew negatively. It is also called left skewed.Positive Skewness :If skewness > 0, data is positively skewed. Another way to see positive skewness : Mean is greater than median and median is greater than mode.

Negative Skewness :If skewness < 0, data is negatively skewed. Another way to see negative skewness : Mean is less than median and median is less than mode.

**Rule :**

- If skewness < −1 or > +1, the distribution is highly skewed.
- If skewness is between −1 and −0.5 or between 0.5 and +1, the distribution is moderately skewed.
- If skewness > −0.5 and < 0.5, the distribution is
**approximately symmetric**or**normal**.

ods select Moments;

proc univariate data = sashelp.shoes;

var sales;

run;

Skewness and Normality |

**Since Skewness is greater than 1, it means data is highly skewed and non-normal.****III. Normality Tests**

**NORMAL**keyword tells SAS to generate normality tests.

ods select TestsforNormality;

proc univariate data = sashelp.shoesnormal;

var sales;

run;

Tests for Normality |

The two main tests for normality are as follows :

**1. Shapiro Wilk Test [Sample Size <= 2000]**

It states that the null hypothesis - distribution is normal.

In the example above, p value is less that 0.05 so we reject the null hypothesis. It implies distribution is not normal. If p-value > 0.05, it implies distribution is normal.

This test performs well in small sample size up to 2000.

**2. Kolmogorov-Smirnov Test [Sample Size > 2000]**

If p-value > 0.05, data is normal. In the example above, p-value is less than 0.05, it means data is not normal.This test can handle larger sample size greater than 2000.

**5. Calculate Custom Percentiles**

With

**PCTLPTS=**option, we can calculate custom percentiles. Suppose you need to generate 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 percentiles.

proc univariate data = sashelp.shoes noprint;The

var sales;

output out = temp

pctlpts = 10 to 100 by 10 pctlpre = p_;

run;

**OUTPUT OUT=**statement is used to tell SAS to save the percentile information in

**TEMP**dataset. The

**PCTLPRE=**is used to add prefix in the variable names for the variable that contains the

**PCTLPTS=**percentile.

Suppose you want to calculate 97.5 and 99.5 percentiles.

proc univariate data = sashelp.shoes noprint;

var sales;

output out = temp

pctlpts = 97.5,99.5 pctlpre = p_;

run;

**6. Calculate Winsorized and Trimmed Means**

The Winsorized and Trimmed Means are insensitive to Outliers. They should be reported rather than mean when the data is highly skewed.

**Trimmed Mean :**Removing extreme values and then calculate mean after filtering out the extreme values. 10% Trimmed Mean means calculating 10th and 90th percentile values and removing values above these percentile values.

**Winsorized Mean :**Capping extreme values and then calculate mean after capping extreme values at kth percentile level. It is same as trimmed mean except removing the extreme values, we are capping at kth percentile level.

**Winsorized Mean**

In the example below, we are calculating 20% Winsorized Mean.

ods select winsorizedmeans;

ods output winsorizedmeans=means;

proc univariate winsorized = 0.2 data=sashelp.shoes;

var sales;

run;

Winsorized Means |

**Percent Winsorized in Tail :**20% of values winsorized from each tail (upper and lower side)

**Number Winsorized in Tail :**79 values winsorized from each tail

**Trimmed Mean**

In the example below, we are calculating 20% trimmed Mean.

ods select trimmedmeans;

ods output trimmedmeans=means;

proc univariate trimmed = 0.2 data=sashelp.shoes;

var sales;

run;

**7. Calculate Sample T-test**

It tests the null hypothesis that mean of the variable is equal to 0. The alternative hypothesis is that mean is not equal to 0. When you run PROC UNIVARIATE, it defaults generates sample t-test in 'Tests for Location' section of output.

ods select TestsForLocation;

proc univariate data=sashelp.shoes;

var sales;

run;

Since p-value is less than 0.05. we reject the null hypothesis. It concludes the mean value of the variable is significantly different from zero.

Ttest with PROC Univariate |

**8. Generate Plots**

PROC UNIVARIATE generates the following plots :

- Histogram
- Box Plot
- Normal Probability Plot

The

**PLOT**keyword is used to generate plots.proc univariate data=sashelp.shoes PLOT;

var sales;

run;

Very good article. I am just loving listendata.

ReplyDeleteThank you for your appreciation. Cheers!

DeleteNevermind, I skipped the part. Thank you so much. I will bookmark your page. You are the best

ReplyDeleteThe way you described the article, really appreciated..!!

ReplyDeleteThis is really nice platform to learn SAS

ReplyDeleteGreat tutoring Mr Bhalla.... I always look for your material on a particular topic I am searching for.Please keep posting. Can you make a series on PROC SGPLOTS please.... Thank you.

ReplyDeleteGreat brother

ReplyDeleteWhen everyone around the world is busy minting money to teach. You are doing a great job by providing valuable information for free. your explanation is so easy to understand and also almost cover all the area.

ReplyDeleteGreat job. Keep up the good work.