This tutorial explains how to identify first and last observations within a group. It is a common data cleaning challenge to remove duplicates or store unique values. In SQL, we use window functions such as rank over() to generate serial numbers among a group of rows. In SAS, we can create

**first. and last. variables**to achieve this task.**First. and Last. Variables****FIRST.VARIABLE**assigns the value of 1 for the**first**observation in a BY group and the value of 0 for all other observations in the BY group.**LAST.VARIABLE**assigns the value of 1 for the

**last**observation in a BY group and the value of 0 for all other observations in the BY group.

Note : Data set must be sorted BY group before applying FIRST. and LAST. Variables.

**SAMPLE DATA SET**

Suppose you have a dataset consisting 3 variables and 12 observations. The variables are ID, Name and Score. The variable ID is a grouping variable and it contains duplicates.

ID | Name | Score |
---|---|---|

1 | David | 45 |

1 | David | 74 |

2 | Sam | 45 |

2 | Ram | 54 |

3 | Bane | 87 |

3 | Mary | 92 |

3 | Bane | 87 |

4 | Dane | 23 |

5 | Jenny | 87 |

5 | Ken | 87 |

6 | Simran | 63 |

8 | Priya | 72 |

**Create this data set in SAS**

The program below creates the dataset in SAS. Copy the program below and paste it into SAS program editor and run/submit it.

Use

**PROC SORT**to sort the data set by ID. It is required to sort the data before using first. and last. variables.

**Note : FIRST./LAST.**variables are

**temporary variables**. That means they are not visible in the newly created data set. To make them visible, we need to create two new variables. In the program above, i have created First_ID and Last_ID variables.

SAS : FIRST. vs. LAST. Variables |

**How it works**

FIRST.variable = 1 when an observation is the first observation in each group values of variable ID.

FIRST.variable = 0 when an observation is not the first observation in each group values of variable ID.

LAST.variable = 1 when an observation is the last observation in each group values of variable ID.

LAST.variable = 0 when an observation is not the last observation in each group values of variable ID.

WhenFIRST.variable = 1andLAST.VARIABLE = 1, it means there is only a single value in the group. (See ID = 4 in the above data for reference)

**Selecting First Observation within a Group**

Suppose you need to select only the first observation among a group of observations. It is very easy to do it with IF statement. The IF statement subsets data when IF is not used in conjunction with THEN or ELSE statements.

PROC SORT DATA = READIN;

BY ID;

RUN;

DATA READIN1;

SET READIN;

BY ID;

IF FIRST.ID;

PROC PRINT;

RUN;

Output : First. Variable |

**Note :**It returns first observation among values of a group (total 7 observations).

**Selecting Last Observation within a Group**

Suppose you are asked to include only last observation from a group. Like the previous example, we can use last. variable to subset data.

PROC SORT DATA = READIN;

BY ID;

RUN;

DATA READIN1;

SET READIN;

BY ID;

IF LAST.ID;

PROC PRINT;

RUN;

SAS : Last. Variable |

**Q. Can we use WHERE instead of IF with First. and Last. Variables?**

No. WHERE statement cannot be used with First. and Last. Variables. It is because WHERE statement requires variables already be created in the dataset before processing.

**Generate Serial Number in a Group**

Suppose you need to create serial numbers among a group of observations. See the snapshot below -

Generate Serial Number in a Group |

Data temp;In the above program, we are setting N=1 when it is the first value of a group i.e.

set readin;

by ID;

if first.id then N = 1;

else N +1;

proc print;

run;

**ID**. Otherwise adding 1 to N.

**The N+1 implies N = N + 1**in BY group processing

**.**When there is a second observation in a group, N+1 adds 1 to N=1 so N becomes 2. It further increments by 1 when there is third observation in the group and so on.

**Calculate Cumulative Score by Group**

Suppose you need to calculate running cumulative score by variable ID.

Cumulative Score by Group |

Data temp;In the above program, we are setting

set readin;

by ID;

if first.id then CumScore = Score;

else CumScore + Score;

proc print;

run;

**Cumscore = Score**when it is the first value of a group i.e. ID. Otherwise adding Score to Cumscore. The Cumscore+Score implies

**CumScore = CumScore + Score**in BY group processing.

**STORING UNIQUE AND DUPLICATE VALUES**

data unique duplicates;-The DATA statement creates two temporary SAS data sets:

set readin;

by id;

if first.id = 1 and last.id = 1 then output unique;

else output duplicates;

run;

**DUPLICATES AND UNIQUE.**

-The SET statement reads observations from data set

**READIN**-The BY statement tells SAS to process observations by ID. Variables

**FIRST.ID**and

**LAST.ID**are created.

-If the first and last observation is the same value, it implies it is a

**unique value**else the value is duplicate.

**Case Studies**

1. Identify and select only records having maximum

**Score**among a group of observations of variable

**ID**

2. Select

**unique**observations plus

**second observation**from duplicate observations of variable

**ID**

**Solution 1**

First we need to build a logic how we can select records having max score within variable ID. We can do it via PROC SORT. In this case, we need to sort data by 2 variables - first sorting on variable ID and then next sorting on Score by descending order. The DESCENDING keyword is used in PROC SORT to arrange data from largest to smallest. Sorting on descending order is used to place the max value at first observation in each group of ID.

proc sort data= readin;After sorting, we retain records having maximum value by using FIRST. and IF statement. The IF FIRST.ID keeps the first record among a group of values of variable ID. The output is shown in the image below.

by ID descending score;

run;

data readin1;

set readin;

by ID;

if first.id;

run;

Output |

*Solve second case study yourself and post your answer in the comment box below. Make sure it should be solved in one data step code. Hint : BOTH FIRST. and LAST. variables would be used.*
nice explaination

ReplyDeletegood explanation and good examples

ReplyDeletedata test8;

ReplyDeleteset readin;

by ID;

if first.ID =1 and last.ID = 1 then output test8;

else if mod(_n_,2) = 0 then output test8;

run;

proc print; run;

data uniq_or_Dup2;

Deleteset readlin;

if (first.Id=1 and last.id=1) or (lag1(first.id)=1 and first.id=0) then output uniq_or_Dup2;

proc print

run;

proc sort data=data;

ReplyDeleteby ID;

run;

data data1;

set data;

by ID;

if first.ID=1 then serial=1;

else serial+1;

run;

data data2;

set data1;

by id;

if first.ID=1 and last.ID=1 or serial=2;

run;

if we have four duplicates how can we get second one ?

ReplyDeletefirst ne 1 and last. ne 1;

Deletedata test8;

ReplyDeleteset readin;

by ID;

if first.ID =1 and last.ID = 1 then output test8;

if first.id and last.id=0 then N=1;

else N+1;

if N=2 then output test8;

drop N;

run;

data nk;

ReplyDeleteset readin;

by id;

if first.id then N=1;

else N+1;

if N=2 then output;

if first.id and last.id then output;

drop N;

proc print;

run;

data new2;

ReplyDeleteset readin1;

retain cnt;

if first_id=1 then cnt=1;

else cnt+1;

if cnt le 2;

run;

this is the best logic ,i guess. you forgot to add by statement. i am just starter and took time to findout that ...

Deleterajeshmanepalli987@gmail.com,if you are comfortable and ready to help, I have some doubts adn will drop a mail to you.If you are intersted ,no problem. thanks.

Deletedata ad;

ReplyDeleteset abcd_s;

by x;

if first.x and last.x then output;

if first.x then n=1;

else n+1;

if n=2 then output;

drop n;

run;

Data Readin2;

ReplyDeleteset readin;

by id;

if first.id then n =1;

Else n+1;

if first.id = 1 and last.id= 1 or n = 2;

drop n

Run;

Its very helpful, I have come across many questions in my interviews

ReplyDeletedata gaurav;

ReplyDeleteset readin;

by id;

if first.id then N=1;

else N+1;

if (first.id=1 and last.id=1) then output;/* unique observation*/

if N=2 then output;/* second observation from duplicate*/

run;

proc print;

run;

If we have 4 or more than 4 duplicate ids then what would be the logic to find out 2nd observation?

ReplyDeleteThanks!

proc sort data= readin;

ReplyDeleteby ID descending score;

run;

data readin1;

set readin;

by ID;

if last.id;

run;

if i want to calculate average per person then how to calculate??

ReplyDeleteproc sort data= readin;

Deleteby ID Score;

run;

data temp;

set readin;

by ID;

if first.ID then

do;

N=1;

cumScore = Score;

avgScore = Score;

end;

else

do;

N+1;

cumScore + Score;

avgScore = cumScore/N;

end;

if last.ID = 1;

run;

Doesn't generating serial numbers as well as cumulative rank require the newly created variables - N and Cum Score to be retained. Else, SAS will not be able to store their values jumping from one obs to the next ?

ReplyDeleteBelow code is working for case study 2. Let me know if anybody still facing any problem in solving or understanding this question.

ReplyDeleteproc sort data=READIN;

BY ID Descending score;

run;

DATA READIN2;

SET READIN;

BY ID ;

First_ID= First.ID;

Last_ID= Last.ID;

IF first.id then N=1;

ELSE N+1;

if (FIRST_ID=1 and LAST_ID=1) or N=2 then output READIN2 ;

drop First_ID Last_ID N;

proc print;

run;

Gd one.

DeleteHow can we use ELSE N+1 in data step with using retain? Data step works obs by obs din't N loses its previous value in every next iteration?

ReplyDeleteI know, I am missing something, but what?

Retain can be used in two ways:

Delete1. Implicitly : We can use retain implicitly by using +1 notation. If we use retain implicitly then we don't need to write retain statement. just like above code.

Another example :

data a;

set xyz;

by id;

if first.id then x=1;

else x+1 : /which means x=x+1 but we are using retain implicitly so no need to mention x= here;

run;

2. Explicitly

Using the same example where we are using retain explicitly so need to mention retain statement.

data a;

set xyz;

by id;

retain x;

if first.id then x=1;

else x=x+1 : /here we are using x= so need to mention retain statement;

run;

This comment has been removed by the author.

ReplyDelete

ReplyDeleteproc sort data=test;

by id score;

run;

data readin;

set test;

by id score;

first_var=first.id;

last_var=last.id;

if (first_var=1 and last_var=1) or last_var=1;

run;

proc sort data=test;

ReplyDeleteby id;

run;

data readin;

set test;

by id;

if (first.id=1 and last.id=1) then output unique;

if (last.id=1) then output second_dup;

run

proc print;

run;

"if (last.id=1) then output second_dup;" will give unique records as well.

Deletemoreover if we have more than 2 records for for any ID then it will give last record for that and not the second record.

So this will work only for those IDs which are having exactly 2 records only.

It will be better if you can assign serial number by grouping on ID basis and then select value for that serial no=2 for 2nd observation.

Please let me know if you are having any confusion at any point.

data readin1;

ReplyDeleteset readin;

by id;

if first.id=1 and last.id=1 then serial=1;

else serial+1 ;

run;

data readin2;

set readin1;

where serial=1 or serial=2;

run;

This comment has been removed by the author.

ReplyDeletedata aaa;

ReplyDeleteset readin;

by id;

if first.id then n=1;

else n+1;

if not (first.id and last.id) and n<=2;

run;

data unique (drop =N) second (drop =N);

ReplyDeleteset readin;

by id;

retain N;

if first.id =1 and last.id =1 then output unique ;

if first.id then N= 1;

else N= N+1;

if N =2 then output second;

run;

PROC SORT DATA=READIN;

ReplyDeleteBY ID;

RUN;

DATA TEMP;

SET READIN;

BY ID;

IF FIRST.ID=1 AND LAST.ID=1 THEN OUTPUT TEMP;

ELSE IF MOD(_N_,2)=0 THEN OUTPUT TEMP;

RUN;

PROC PRINT;

RUN;