This post demonstrates techniques to find unique and duplicate values in a SAS data set. It is one of the most common interview questions as it is commonly used in day-to-day data management activities. SAS has some easy inbuilt options to handle duplicate records.
Below is a sample data set that can be used for demonstration.SAMPLE DATA SET
ID | Name | Score |
---|---|---|
1 | David | 45 |
1 | David | 74 |
2 | Sam | 45 |
2 | Ram | 54 |
3 | Bane | 87 |
3 | Mary | 92 |
3 | Bane | 87 |
4 | Dane | 23 |
5 | Jenny | 87 |
5 | Ken | 87 |
6 | Simran | 63 |
8 | Priya | 72 |
Create this data set in SAS
data readin; input ID Name $ Score; cards; 1 David 45 1 David 74 2 Sam 45 2 Ram 54 3 Bane 87 3 Mary 92 3 Bane 87 4 Dane 23 5 Jenny 87 5 Ken 87 6 Simran 63 8 Priya 72 ; run;
There are several ways to identify unique and duplicate values:
1. PROC SORT
In PROC SORT, there are two options by which we can remove duplicates.
- NODUPKEY Option
- NODUP Option
The NODUPKEY option removes duplicate observations where value of a variable listed in BY statement is repeated while NODUP option removes duplicate observations where values in all the variables are repeated (identical observations).
The difference between these two options are explained in detail below with SAS codes-
PROC SORT DATA = readin NODUPKEY; BY ID; RUN; |
PROC SORT DATA = readin NODUP; BY ID; RUN; |
The output is shown below :
NODUPKEY | NODUP | |||||||
ID | Name | Score | ID | Name | Score | |||
1 | David | 45 | 1 | David | 45 | |||
2 | Sam | 45 | 1 | David | 74 | |||
3 | Bane | 87 | 2 | Sam | 45 | |||
4 | Dane | 23 | 2 | Ram | 54 | |||
5 | Jenny | 87 | 3 | Bane | 87 | |||
6 | Simran | 63 | 3 | Mary | 92 | |||
8 | Priya | 72 | 3 | Bane | 87 | |||
4 | Dane | 23 | ||||||
5 | Jenny | 87 | ||||||
5 | Ken | 87 | ||||||
6 | Simran | 63 | ||||||
8 | Priya | 72 |
The NODUPKEY has deleted 5 observations with duplicate values whereas NODUP has not deleted any observations.
Although ID 3 has two identical records (See observation 5 and 7), NODUP option has not removed them. It is because they are not next to one another in the dataset and SAS only looks at one record back.
To fix this issue, sort on all the variables in the dataset READIN.To sort by all the variables without having to list them all in the program, you can use the keyword ‘_ALL_’ in the BY statement (see below).
PROC SORT DATA = readin NODUP; BY _all_; RUN;
The output is shown below :
SAS NODUP Option |
STORING DUPLICATES
Use the DUPOUT= option with NODUPKEY (or NODUP) to output duplicates to the specified SAS data set:
PROC SORT DATA = readin NODUPKEY DUPOUT= readin1; BY ID; RUN;
The output is shown below :
Output Dataset |
2. FIRST. and LAST. VARIABLES
FIRST.VARIABLE assigns the value of 1 for the first observation in a BY group and the value of 0 for all other observations in the BY group.
LAST.VARIABLE assigns the value of 1 for the last observation in a BY group and the value of 0 for all other observations in the BY group.
Data set must be in sort order.Use PROC SORT to sort the data set by ID.
PROC SORT DATA = READIN; BY ID; RUN; DATA READIN1; SET READIN; BY ID; First_ID= First.ID; Last_ID= Last.ID; RUN;Note : FIRST./LAST. variables are temporary variables. That means they are not visible in the newly created data set. To make them visible, we need to create two new variables. In the program above, i have created First_ID and Last_ID variables.
Unique and Duplicate Flag |
DATA DUPLICATES UNIQUE; SET READIN; BY ID; First_ID= First.ID; Last_ID= Last.ID; IF NOT (First_ID = 1 and Last_ID = 1) THEN OUTPUT DUPLICATES; ELSE OUTPUT UNIQUE; RUN;
- The DATA statement creates two temporary SAS data sets: DUPLICATES AND UNIQUE.
- The SET statement reads observations from data set READIN
- The BY statement tells SAS to process observations by ID. Variables FIRST.ID and LAST.ID are created.
- The observations where both First_ID and Last_ID do not equal to 1 go to the newly created data set DUPLICATES.
- The ELSE statement outputs all other observations (i.e., where First_ID and Last_ID equal to 1) to data set UNIQUE.
All the questions and Answers are really informative and useful. I heart fully appreciate your effort.Please keep on post latest questions and Answers. They help us a lot...
ReplyDeleteMore stuff on SAS Please!!! Big fan of you :D
ReplyDeleteIf i want to remove reverse duplicates how should I go about it? I have mentioned the data below..For the below data: lets say we have one row which shows Bangalore to Pune and the other record shows Pune to Bangalore. I just want to keep any one of those values..
ReplyDeleteData Distance;
Length Source Destination $10.;
Input Source $ Destination $ Distance;
Cards;
Bangalore Pune 1500
BBSR Bangalore 1300
Mumbai Bangalore 2000
Pune BBSR 1800
Bangalore BBSR 1300
Pune Bangalore 1500
Chennai BBSR 1800
Hyderabad Pune 1000
Hyderabad Mumbai 1600
Delhi Bangalore 2200
Pune Hyderabad 1000
;
Run;
Please suggest!
Thanks,
Hi Unmesh,
DeleteYou can use below code for this...
proc sort data = distance;
by distance;
run;
data newdistance;
set distance;
if source eq lag(destination) then x=1;
else x=2;
if x=2 then output;
drop x;
run;
Please let me know if you have any question...
Regards
Nikhil Jain
Hi Nikhil..
DeleteDoes this solution work
If distance is same from same source or destination to any other source or destinations..
Sample data:
Bangalore pune 1500
Pune bangalore 1500
Shajahanpur pune 1500
Please check this..
DeleteData Distance;
Length Source Destination $10.;
Input Source $ Destination $ Distance;
Cards;
Bangalore Pune 1500
BBSR Bangalore 1300
Mumbai Bangalore 2000
Pune BBSR 1800
Bangalore BBSR 1300
Pune Bangalore 1500
Chennai BBSR 1800
Hyderabad Pune 1000
Hyderabad Mumbai 1600
Delhi Bangalore 2200
Pune Hyderabad 1000
;
Run;
proc sort data=distance;
by distance;
run;
data readin1;
set distance;
by Distance;
First_Distance=First.Distance;
IF (Distance=1500 and First_Distance=1) then delete;
run;
This is more simpler way of doing. Try this
Deleteproc sort data=a2;
by distance;
run;
data a3;
set a2;
by distance;
if first.distance then output;
run;
proc sort data=Distance nodupkey dupout=Sorted;
Deleteby distance;
run;
It's really very helpful...
ReplyDeletevery Helpful..
ReplyDeleteCould someone please help me in this?:How should I do if I want to remove reverse duplicates for the below data: let´s say we have one row which shows interaction between rs1, rs2,through mode1 and mode2 and the other record shows the same interaction but changing the order of the SNPs and consequently of the mode of inheritance?
ReplyDeleters11122 rs33344 rec dom
rs33344 rs11122 dom rec
I just want to keep one of those rows.
Any help would be really appreciated
Data Distance2;
DeleteLength rs1 rs2 $10.;
Input rs1 $ rs2 $ mode1 $ mode2$;
Cards;
rs11122 rs33344 rec dom
rs33344 rs11122 dom rec
;
Run;
proc sort data = Distance2;
by mode1 mode2;
run;
data test;
set distance2;
if rs1 EQ Lag(rs2) then x = 1; else x = 2;
if mode1 EQ Lag(mode2) then y = 1; else y = 2;
if x= 2 and y = 2 then output;
run;
why David is there after _all_? same id same name
ReplyDeleteNODUP option removes duplicate observations where values in all the variables are repeated (identical observations).
DeleteNODUP check entire row, for both David score is different.
proc sort data=Distance nodupkey dupout=Sorted;
ReplyDeleteby distance;
run;