SAS SQL : Find records only exist in one table but not other

It is one of the most common data manipulation problem to find records that exist only in table 1 but not in table 2. This post includes 3 methods with PROC SQL and 1 method with data step to solve it. This problem statement is also called 'If a and not b' in SAS. It means pull records that exist only in Table A but not in Table B (Exclude the common records of both the tables). See the Venn Diagram below -
If a and not b

The main thing to focus in Venn Diagram is the intersection area of table A and table B. It is NOT highlighted in red because we don't want the records which are common in both the tables.

Let's create a sample data

If you look at the tables below, we are looking to fetch all the records from table1 except 'Ram' and 'Priya' as these two names are in table2.
Input Data

Create two datasets in SAS

The following programs create two data sets in SAS which are used to demonstrate methods to solve this problem.
data dataset1;
input name $;
cards;
Dave
Ram
Sam
Matt
Priya
;
run;
data dataset2;
input name$;
cards;
Ram
Priya
;
run;
In SQL, there are multiple ways to solve this problem. The methods are listed below -

Method I - NOT IN Operator

The simplest method is to write a subquery and use NOT IN operator, It tells system not to include records from dataset 2.
proc sql;
select * from dataset1
where name not in (select name from dataset2);
quit;
The output is shown in the image below -
Output

Method II - LEFT JOIN with NULL Operator

In this method, we are performing left join and telling SAS to include only rows from table 1 that do not exist in table 2.
proc sql;
select a.name from dataset1 a
left join dataset2 b
on a.name = b.name
where b.name is null;
quit;
How it works -
In the first step, it reads common column from the both the tables - a.name and b.name. At the second step, these columns are matched and then the b.name row will be set NULL or MISSING if a name exists in table A but not in table B. At the next step, WHERE statement with 'b,name is null' tells SAS to keep only records from table A.

Method III -  Not Exists Correlated SubQuery

NOT EXISTS subquery writes the observation to the merged dataset only when there is no matching rows of a.name in dataset2. This process is repeated for each rows of variable name.
proc sql;
select a.name from
dataset1 a
where not exists (select name from dataset2 b
where a.name = b.name);
quit;
How it works -

Step 1 - At the background, it performs left join of the tables -
proc sql;
create table step1 as
select a.* from dataset1 a
left join dataset2 b
on a.name = b.name;
quit;
Step 2 - At the next step, it checks common records by applying INNER JOIN
proc sql;
create table step2 as
select a.name from dataset1 a, dataset2 b
where a.name = b.name;
quit;

Step 3 - At the last step, it excludes common records.
proc sql;
select * from step1
where name not in (select distinct name from step2) ;
quit;

Method IV : SAS Data Step MERGE Statement

In SAS Data Step, it is required to sort tables by the common variable before merging them. Sorting can be done with PROC SORT.
proc sort data = dataset1;
by name;
run;
proc sort data = dataset2;
by name;
run;
Data finaldata;
merge dataset1 (in=a) dataset2(in=b);
by name;
if a and not b;
run;
The MERGE Statement joins the datasets dataset1 and dataset2 by the variable name.

Q. Which is the most efficient method?

To answer this question, let's create two larger datasets (tables) and compare the 4 methods as explained above.

Table1 - Dataset Name : Temp, Observations - 1 Million, Number of Variables - 1

Table2 - Dataset Name : Temp2, Observations - 10K, Number of Variables - 1
data temp;
length x $15.;
do i = 1 to 1000000;
x = "AA"||strip(i);
output;
end;
drop i;
run;
data temp2;
length x $15.;
do i = 1 to 10000;
x = "AA"||strip(i);
output;
end;
drop i;
run;

Result
SAS Dataset MERGE (Including prior sorting) took least time (1.3 seconds) to complete this operation, followed by NOT IN operator in subquery which took 1.4 seconds and then followed by LEFT JOIN with WHERE NULL clause (1.9 seconds). The NOT EXISTS took maximum time.

Tip - In many popular forums, it is generally advised to use NOT EXISTS rather than NOT IN. This advise is generally taken out of context. Modern softwares use SQL optimizer to process any SQL query. Some softwares may consider both the queries as same in terms of execution so there would not be a noticeable difference in their CPU timings. Some may be in favor of NOT EXISTS. SAS seems to be in favor of NOT IN operator as it does not require tables to be merged.
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

4 Responses to "SAS SQL : Find records only exist in one table but not other "

Next → ← Prev
Ad-Blocker Detected