**PROC MEANS**is one of the most common SAS procedure used for analyzing data. It is mainly used to calculate descriptive statistics such as mean, median, count, sum etc. It can also be used to calculate several other metrics such as percentiles, quartiles, standard deviation, variance and sample t-test.

**Uses of PROC MEANS**

- Analyze numeric or continuous variables
- Analyze numeric variables by group(s)
- Identifies outlier or extreme values
- Hypothesis Testing with Sample T-test

**Dataset Description**

**Download dataset used in the examples**

The data includes seven variables and 499 observations. It comprises of survey responses from variables Q1 through Q5 and two demographics - Age and BU (Business Unit). The survey responses lie between 1 to 6.To use the dataset in SAS, you can use

**PROC IMPORT**to read data into SAS. See the code below -

proc import datafile='C:\Users\Deepanshu\Downloads\test.xls'

out=test

dbms = xls;

run;

**Simple Example :**

In the

**DATA= option**, you need to specify the dataset you want to use. In the

**VAR= option**, you need to refer the numeric variables you want to analyze. You cannot refer character variables in the VAR statement.

Proc Means Data = test;

Var q1 - q5;

Run;

Proc Means Output |

**By default**, PROC MEANS generates N, Mean, Standard Deviation, Minimum and Maximum statistics.

**Common Statistical Options**

*The most frequent statistical options used in PROC MEANS are listed below against their description.*

Statistical Option | Description |
---|---|

N | Number of observations |

NMISS | Number of missing observations |

MEAN | Arithmetic average |

STD | Standard Deviation |

MIN | Minimum |

MAX | Maximum |

SUM | Sum of observations |

MEDIAN | 50th percentile |

P1 | 1st percentile |

P5 | 5th percentile |

P10 | 10th percentile |

P90 | 90th percentile |

P95 | 95th percentile |

P99 | 99th percentile |

Q1 | First Quartile |

Q3 | Third Quartile |

**Other Statistical Options**

Statistical Option | Description |
---|---|

VAR | Variance |

RANGE | Range |

USS | Uncorr. sum of squares |

CSS | Corr. sum of squares |

STDERR | Standard Error |

T | Student’s t value for testing Ho: md = 0 |

PRT | P-value associated with t-test above |

SUMWGT | Sum of the WEIGHT variable values |

QRANGE | Quartile range |

**Limit Descriptive Statistics**

Suppose you want to see only two statistics - number of non-missing values and number of missing values.

Proc Means Data = testN NMISS;

Var q1 - q5 ;

Run;

Displaying number of missing values |

**N**refers to number of non-missing values and

**NMISS**implies number of missing values.

**Tips :**Add

**NOLABELS**option to delete Label column in the PROC MEAN table.

Proc Means data = test N NMISSNOLABELS;

Var q1 - q5;

Run;

**Group the analysis**

Suppose you want to group or classify the analysis by Age. You can use the

**CLASS**statement to accomplish this task. It is equivalent to**GROUP BY in SQL.**Proc Means data = test N NMISS NOLABELS;

ClassAge;

Var q1 - q5;

Run;

Option Statement |

You can use

**NONOBS**option to delete N Obs column from the Proc Means table.Proc Means data = test N NMISS NOLABELSNONOBS;

Class Age;

Var q1 - q5;

Run;

**Use Format in Proc Means**

First, you need to create an user defined format.

Proc Format;

Value Age

1 = 'Less than 25'

2 = '25-34'

3 = '35-43'

4 = '44-50'

5 = '51-59'

6 = '60 or more';

Run;

**FORMAT**statement to use user defined format in

**PROC MEANS**.

Proc Means data = test N MEAN;

Class Age;

Format Age Age.;

Var q1 - q5;

Run;

**Change Sorting Order**

**DESCENDING**option to the right of the slash in the first CLASS statement instructs PROC MEANS to analyze the data in

**DESCENDING**order of the values of Age.

Proc Means Data = test;

Class Age / descending;

Var q1 - q5 ;

Run;

**ORDER=FREQ**option in the CLASS Statement.

Proc Means Data = test N;

Class Age/ Order = FREQ;

Var q1 - q5 ;

Run;

ORDER = FREQ Option |

You can order the results by user defined format of a variable specified in the CLASS statement using the

**ORDER=FORMATTED**option in the CLASS Statement.

Proc Means data = test N MEAN;

Class Age / Order = formatted;

Format Age Age.;

Var q1 - q5;

Run;

**Note :**If you specify

**CLASS**statement without

**VAR**statement, it classifies the analysis by all numeric variables in your data set.

**Grouping and Output in Separate Tables**

Suppose you want to analyze variables Q1 - Q5 by variable AGE and want the output of each levels of AGE in

**separate tables**. You can use

**BY statement**to accomplish this task. See the example below-

**Make sure**you sort the data before using BY statement.

proc sort data= test;

by age;

run;

proc means data = test;

by age;

var q1 - q5 ;

run;

*Difference between CLASS and BY statement*The CLASS statement returns analysis for a grouping (classification) variable in a

**single table**whereas BY statement returns the analysis for a grouping variable in

**separate tables.**Another difference is CLASS statement does not require the classification variable to be pre-sorted whereas BY statement demands sorting.

PROC MEANS Output |

**Save output in a data set**

You can use

**NOPRINT**option to tell SAS not to print output in output window.

Proc Means data = testNOPRINT;

Class Age / Order = formatted;

Format Age Age.;

Var q1 - q5;

Output out = readin mean= median = /autoname;Run;

In the above code, readin is a data set in which output will be stored. The

**MEAN= MEDIAN=**options tells SAS to generate mean and median in the output dataset. The**AUTONAME**Option automatically assigns unique variable names in the Output Data Set “holding” the statistics requested in the**OUTPUT**statement.You can use

**AUTOLABEL**option to automatically assigns unique label names in the Output Data Set “holding” the statistics requested in the

**OUTPUT**statement.

Proc Means Data = test noprint;

Class Age ;

Var q1 q2;

Output out=F1 mean= / autonameautolabel;

Run;

You can specify variables for which you want summary statistics to be saved in a output data set.

Proc Means Data = test noprint;

Class Age ;

Var q1 q2;

Output out=F1 mean(q1)= median(q2)= / autoname;

Run;

You can give custom names to variables stored in a output data set.

Proc Means Data = test noprint;

Class Age;

Var q1 - q5 ;

Output out=F1 mean=_mean1-_mean5 median=_median1-_median5;Run;

**DROP = , KEEP = option**

We can use DROP and KEEP options to remove or keep some specific variables.

Proc Means Data = test noprint;

Class Age;

Var q1 - q5 ;

Output out=F1(drop = _type_ _freq_)mean=_mean1-_mean5 median=_median1-_median5;

Run;

**WHERE Statement**

The WHERE statement is used to filter or subset data. In the code below, we are filtering on variable Q1 and telling SAS to keep only those observations in which value of Q1 is greater than 1.

Proc Means Data = test noprint;Where Q1 > 1;Class Age;

Var q1 - q5 ;

Output out=F1(drop= _FREQ_) mean= median= / autoname;

Run;

Like WHERE statement, we can use

**WHERE= OPTION**to filter data. See the following program -Proc Means Data = test(Where=( Q1 > 1))noprint;

Class Age;

Var q1 - q5 ;

Output out=F1(drop= _FREQ_) mean= median= / autoname;

Run;

**GROUPING on 2 (or more) Variables**

When two ore more variables are included in the CLASS statement, PROC MEANS returns 3 levels of classification which is shown in the

**_TYPE_**variable. Suppose we are specifying variables AGE BU in the CLASS statement. SAS first returns mean and median of variables Q1-Q5 by BU. It is the first level of classification which can be filtered by using

**WHERE = ( _TYPE_ = 1).**The same analysis by AGE is shown against

**_TYPE_ = 2. When _TYPE_ = 3,**SAS returns analysis by both the variables AGE and BU.

Proc Means Data = test noprint;Using the

Class Age BU;

Var q1 - q5 ;

Output out=F1 (where=(_type_=1) drop= AGE _FREQ_) mean= median= / autoname;

Output out=F2 (where=(_type_=2) drop= BU _FREQ_) mean= median= / autoname;

Output out=F3 (where=(_type_=3) drop= _FREQ_) mean= median= / autoname;

Run;

**NWAY**option instructs PROC MEANS to output only observations with the

**highest value of _TYPE_**to the new data set it is creating.

Proc Means Data = testBy default, PROC MEANS will analyze the numeric analysis variables at all possible combinations of the values of the classification variables. With thenwaynoprint;

Class Age;

Var q1 - q5 ;

Output out=F1 mean=_mean1-_mean5 median=_median1-_median5;

Run;

**TYPES statement**, only the analyses specified in it are carried out by PROC MEANS.

Proc Means Data = test noprint;

Class Age BU Q1;

Types()

Age * BU

Age * BU * Q1;

Var q1 - q5;

Output out=F1 mean=_mean1-_mean5 max=_median1-_median5;

Run;

**DESCENDTYPES**Option : Orders rows/observations in the output data set by descending value of _TYPE_.

Proc Means Data = testDESCENDTYPESnoprint;

Class Age;

Var q1 - q5 ;

Output out=F1 mean=_mean1-_mean5 median=_median1-_median5;

Run;

**Multiple CLASS Statements**

Multiple CLASS statement permit user control over how the levels of the classification variables are portrayed or written out to new data sets created by PROC MEANS. It means any one of the classification variable can be displayed in descending order.

Proc Means Data = test noprint;

Class Age / descending;Var q1 - q5 ;

Class BU;

Output out=F1 mean=_mean1-_mean5 max=_median1-_median5;

Run;

**Identifying Extreme Values of Analysis Variables using the IDGROUP Option**

proc means data=electric.electricity noprint nway;

class transformer;

var total_revenue ;

output out= F1

idgroup (max(total_revenue) out[2] (total_revenue)=maxrev)idgroup (min(total_revenue) out[2] (total_revenue)=minrev)sum= mean= /autoname;

run;

**Sample T-Test**

With PROC MEANS, we can perform hypothesis testing using sample t-test.

**Null Hypothesis -**Population Mean of Q1 is equal to 0

**Alternative Hypothesis -**Population Mean of Q1 is not equal to 0.

proc means data = test t prt;

var Q1;

run;

The

**PRT option**returns p-value which implies lowest level of significance at which we can reject null hypothesis. Since p-value is less than 0.05, we can reject the null hypothesis and concludes that mean is significantly different from zero.**Difference between PROC MEANS and PROC FREQ**

PROC MEANS is used to calculate summary statistics such as mean, count etc of numeric variables. It requires at least one numeric variable whereas Proc Freq does not have such limitation. In other words, if you have only one character variable to analyse, PROC FREQ is your friend and procedure to use.

I need to create categories.

ReplyDeleteExample: Levels 1 - 6

I have the data set in SAS. Where to I start?

HI BALU HERE, MANY MANY THANKS FOR THIS PROC MEANS, PLZZZZZ EXPLAIN PROC FREQ TOO..

ReplyDelete