This tutorial explains how to identify first and last observations within a group. It is a common data cleaning challenge to remove duplicates or store unique values. In SQL, we use window functions such as rank over() to generate serial numbers among a group of rows. In SAS, we can create first. and last. variables to achieve this task.
First. and Last. Variables
FIRST.VARIABLE assigns the value of 1 for the first observation in a BY group and the value of 0 for all other observations in the BY group.
First. and Last. Variables
LAST.VARIABLE assigns the value of 1 for the last observation in a BY group and the value of 0 for all other observations in the BY group.
Note : Data set must be sorted BY group before applying FIRST. and LAST. Variables.
SAMPLE DATA SET
Suppose you have a dataset consisting 3 variables and 12 observations. The variables are ID, Name and Score. The variable ID is a grouping variable and it contains duplicates.
Create this data set in SAS
The program below creates the dataset in SAS. Copy the program below and paste it into SAS program editor and run/submit it.
Use PROC SORT to sort the data set by ID. It is required to sort the data before using first. and last. variables.
Note : FIRST./LAST. variables are temporary variables. That means they are not visible in the newly created data set. To make them visible, we need to create two new variables. In the program above, i have created First_ID and Last_ID variables.
|SAS : FIRST. vs. LAST. Variables|
FIRST.variable = 1 when an observation is the first observation in each group values of variable ID.
FIRST.variable = 0 when an observation is not the first observation in each group values of variable ID.
LAST.variable = 1 when an observation is the last observation in each group values of variable ID.
LAST.variable = 0 when an observation is not the last observation in each group values of variable ID.
When FIRST.variable = 1 and LAST.VARIABLE = 1, it means there is only a single value in the group. (See ID = 4 in the above data for reference)
Suppose you need to select only the first observation among a group of observations. It is very easy to do it with IF statement. The IF statement subsets data when IF is not used in conjunction with THEN or ELSE statements.
PROC SORT DATA = READIN;
|Output : First. Variable|
Note : It returns first observation among values of a group (total 7 observations).
Selecting Last Observation within a Group
Suppose you are asked to include only last observation from a group. Like the previous example, we can use last. variable to subset data.
PROC SORT DATA = READIN;
|SAS : Last. Variable|
Q. Can we use WHERE instead of IF with First. and Last. Variables?
No. WHERE statement cannot be used with First. and Last. Variables. It is because WHERE statement requires variables already be created in the dataset before processing.
Generate Serial Number in a Group
Suppose you need to create serial numbers among a group of observations. See the snapshot below -
|Generate Serial Number in a Group|
Data temp;In the above program, we are setting N=1 when it is the first value of a group i.e. ID. Otherwise adding 1 to N. The N+1 implies N = N + 1 in BY group processing. When there is a second observation in a group, N+1 adds 1 to N=1 so N becomes 2. It further increments by 1 when there is third observation in the group and so on.
if first.id then N = 1;
else N +1;
Calculate Cumulative Score by Group
Suppose you need to calculate running cumulative score by variable ID.
|Cumulative Score by Group|
Data temp;In the above program, we are setting Cumscore = Score when it is the first value of a group i.e. ID. Otherwise adding Score to Cumscore. The Cumscore+Score implies CumScore = CumScore + Score in BY group processing.
if first.id then CumScore = Score;
else CumScore + Score;
data unique duplicates;-The DATA statement creates two temporary SAS data sets: DUPLICATES AND UNIQUE.
if first.id = 1 and last.id = 1 then output unique;
else output duplicates;
-The SET statement reads observations from data set READIN
-The BY statement tells SAS to process observations by ID. Variables FIRST.ID and LAST.ID are created.
-If the first and last observation is the same value, it implies it is a unique value else the value is duplicate.
1. Identify and select only records having maximum Score among a group of observations of variable ID
2. Select unique observations plus second observation from duplicate observations of variable ID
First we need to build a logic how we can select records having max score within variable ID. We can do it via PROC SORT. In this case, we need to sort data by 2 variables - first sorting on variable ID and then next sorting on Score by descending order. The DESCENDING keyword is used in PROC SORT to arrange data from largest to smallest. Sorting on descending order is used to place the max value at first observation in each group of ID.
proc sort data= readin;After sorting, we retain records having maximum value by using FIRST. and IF statement. The IF FIRST.ID keeps the first record among a group of values of variable ID. The output is shown in the image below.
by ID descending score;