SAS : Identifying and storing unique and duplicate values

This post demonstrates techniques to find unique and duplicate values in a data set.

SAMPLE DATA SET

ID Name Score
1 David 45
1 David 74
2 Sam 45
2 Ram 54
3 Bane 87
3 Mary 92
3 Bane 87
4 Dane 23
5 Jenny 87
5 Ken 87
6 Simran 63
8 Priya 72

Create this data set in SAS



There are several ways to identify unique and duplicate values:

1. PROC SORT 

In PROC SORT, there are two options by which we can remove duplicates.

1. NODUPKEY Option               2. NODUP Option

The NODUPKEY option removes duplicate observations where value of a variable listed in BY statement is repeated while NODUP option removes duplicate observations where values in all the variables are repeated (identical observations).



NODUPKEY

PROC SORT DATA = readin NODUPKEY;
BY ID;
RUN;
NODUP

 PROC SORT DATA = readin NODUP;
 BY ID;
 RUN;

The output is shown below :

NODUPKEY
NODUP
ID
Name
Score
ID
Name
Score
1
David
45
1
David
45
2
Sam
45
1
David
74
3
Bane
87
2
Sam
45
4
Dane
23
2
Ram
54
5
Jenny
87
3
Bane
87
6
Simran
63
3
Mary
92
8
Priya
72
3
Bane
87
4
Dane
23
5
Jenny
87
5
Ken
87
6
Simran
63
8
Priya
72

The NODUPKEY has deleted 5 observations with duplicate values whereas NODUP has not deleted any observations.

Why no value has been deleted when NODUP option is used?

Although ID 3 has two identical records (See observation 5 and 7), NODUP option has not removed them. It is because they are not next to one another in the dataset and SAS only looks at one record back.

To fix this issue, sort on all the variables in the dataset READIN
To sort by all the variables without having to list them all in the program, you can use the keywork ‘_ALL_’ in the BY statement (see below).


The output is shown below :



STORING DUPLICATES

Use the DUPOUT= option with NODUPKEY (or NODUP) to output duplicates to the specified SAS data set: 


The output is shown below :


2. FIRST./LAST. VARIABLES 

FIRST.VARIABLE assigns the value of 1 for the first observation in a BY group and the value of 0 for all other observations in the BY group.

LAST.VARIABLE assigns the value of 1 for the last observation in a BY group and the value of 0 for all other observations in the BY group.

Data set must be in sort order.

Use PROC SORT to sort the data set by ID.



Note : FIRST./LAST. variables are temporary variables. That means they are not visible in the newly created data set. To make them visible, we need to create two new variables. In the program above, i have created First_ID and Last_ID variables.



STORING UNIQUE AND DUPLICATE VALUES




-The DATA statement creates two temporary SAS data sets: DUPLICATES AND UNIQUE.

-The SET statement reads observations from data set READIN 

-The BY statement tells SAS to process observations by ID. Variables FIRST.ID and LAST.ID are created. 

-The observations where both First_ID and Last_ID do not equal to 1 go to the newly created data set DUPLICATES.

-The ELSE statement outputs all other observations (i.e., where First_ID and Last_ID equal to 1) to data set UNIQUE.


SAS Tutorials : 100 Free SAS Tutorials

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

4 Responses to "SAS : Identifying and storing unique and duplicate values"

  1. All the questions and Answers are really informative and useful. I heart fully appreciate your effort.Please keep on post latest questions and Answers. They help us a lot...

    ReplyDelete
  2. More stuff on SAS Please!!! Big fan of you :D

    ReplyDelete
  3. If i want to remove reverse duplicates how should I go about it? I have mentioned the data below..For the below data: lets say we have one row which shows Bangalore to Pune and the other record shows Pune to Bangalore. I just want to keep any one of those values..

    Data Distance;
    Length Source Destination $10.;
    Input Source $ Destination $ Distance;
    Cards;
    Bangalore Pune 1500
    BBSR Bangalore 1300
    Mumbai Bangalore 2000
    Pune BBSR 1800
    Bangalore BBSR 1300
    Pune Bangalore 1500
    Chennai BBSR 1800
    Hyderabad Pune 1000
    Hyderabad Mumbai 1600
    Delhi Bangalore 2200
    Pune Hyderabad 1000
    ;
    Run;

    Please suggest!

    Thanks,

    ReplyDelete
  4. It's really very helpful...

    ReplyDelete

Next → ← Prev