This tutorial is designed for beginners who want to get started with PROC SQL Joins. It explains different types of joins and the equivalent data step merge code for these joins.
Lesson 1 : Proc SQL Fundamentals with 20 Examples
Advantages of PROC SQL Joins over Data Step Merging
1. Cross Join / Cartesian product
The Cartesian product returns a number of rows equal to the product of all rows (observations) in all the tables (data sets) being joined. For example, if the first table has 10 rows and the second table has 10 rows, there will be 100 rows (10 * 10) in the merged table (data set).
Create these two data sets into SAS
Explanation
Inner Join : Data Step Code
It returns all rows from the left table with the matching rows from the right table.
Explanation
Left Join : Data Step Code
It returns all rows from the right table that do not match any row with the left-hand table, and the matched rows from the left-hand table.
Explanation
Right Join : Data Step Code
Key takeaway : The FULL JOIN suffers the same difficulty as the RIGHT JOIN. Namely, the common variable values are lost from the right-hand data set. The COALESCE function can solve this difficulty.
Explanation
Full Join : Data Step Code
By default, MERGE statement performs full join so IN variables are not required.
Data Step MERGE statement will return the maximum number of missing values in the primary key in both the tables. In this case, it would return 3 missing values i.e. max(3,1).
Example 2 : One to Many Relationship
Lesson 1 : Proc SQL Fundamentals with 20 Examples
Advantages of PROC SQL Joins over Data Step Merging
- PROC SQL joins do not require sorted tables (data sets), while you need to have two data sets sorted when using MERGE statement
- PROC SQL joins do not require that common variable have the same name in the data sets you are joining, while you need to have common variable name listed in BY option when using MERGE statement.
- PROC SQL joins can use comparison operators other than the equal sign (=).
- PROC SQL can handle many to many relationship well whereas Data Step Merge do not.
1. Cross Join / Cartesian product
The Cartesian product returns a number of rows equal to the product of all rows (observations) in all the tables (data sets) being joined. For example, if the first table has 10 rows and the second table has 10 rows, there will be 100 rows (10 * 10) in the merged table (data set).
Create these two data sets into SAS
Data A;
Input ID Name$ Height;
cards;
1 A 1
3 B 2
5 C 2
7 D 2
9 E 2
;
run;
Data B;Cartesian Product : SQL Code
Input ID Name$ Weight;
cards;
2 A 2
4 B 3
5 C 4
7 D 5
;
run;
PROC SQL;
Create table dummy as
Select * from A as x cross join B as y;
Quit;
![]() |
Cartesian or Cross Product |
Key takeaways
1. Since the first data set has 5 rows and the second data set has 4 rows, there are 20 rows (5 * 4) in the merged data set.
2. The 'as' keyword (aka alias) is used to assign a table a temporary name.
3. Since the ID values of the first data set is different than the ID values of the second data set, the ID given in the joined data set is misleading.
2. Inner Join
It returns rows common to both tables (data sets). If we select * keyword in the query, the final merged file would have number of columns equal to (Common columns in both the data sets + uncommon columns from data set A + uncommon columns from data set B).![]() |
Venn Diagram : Inner Join |
PROC SQL;
Create table dummy as
Select * from A as x, B as y
where x.ID = y.ID;
Quit;
![]() |
Inner Join |
Since the above case is of type INNER JOIN, it returns values 5 and 7 from the variable ID in the combined table as these two values are common in both the datasets A and BAnother way to write the above code -
PROC SQL;Both the codes produce same result.
Create table dummy as
Select * from A as x inner join B as y
On x.ID = y.ID;
Quit;
Inner Join : Data Step Code
Data dummy;
Merge A (IN = X) B (IN=Y);
by ID;
If X and Y;
run;
3. Left Join
![]() |
Left Join Venn Diagram |
PROC SQL;
Create table dummy as
Select * from A as x left join B as y
On x.ID = y.ID;
Quit;
![]() |
Left Join |
Since the above case is of type LEFT JOIN, it returns all rows from the table (dataset) A with the matching rows from the dataset B.
Left Join : Data Step Code
Data dummy;
Merge A (IN = X) B (IN=Y);
by ID;
If X ;
run;
4. Right Join
![]() |
Right Join Venn Diagram |
PROC SQL;
Create table dummy as
Select * from A as x right join B as y
On x.ID = y.ID;
Quit;
![]() |
Right Join |
Note : The right-hand table ID values are missing in the merged table.
To add the missing right hand table ID values to a right join, you can use the SQL COALESCE function. The COALESCE function returns the first non-missing argument.
proc sql;
create table dummy as
select coalesce (x.ID,y.ID) as ID, coalesce (x.name,y.name) as name,height,weight
from a as x right join b as y
on x.id = y.id;
quit;
![]() |
Right Join with Coalesce |
Since the above case is of type RIGHT JOIN, it returns all rows from the table (dataset) B with the matching rows from the dataset A.
Right Join : Data Step Code
Data dummy;
Merge A (IN = X) B (IN=Y);
by ID;
If Y ;
run;
5. Full Join
It returns all rows from the left table and from the right table.
It returns all rows from the left table and from the right table.
![]() |
Full Join |
proc sql;
create table dummy as
select coalesce (x.ID,y.ID) as ID, coalesce (x.name,y.name) as name,height,weight
from a as x full join b as y
on x.id = y.id;
quit;
![]() |
Full Join with Coalesce |
Explanation
Since the above case is of type FULL JOIN, it returns all rows from the table (dataset) A and B.
Data dummy;
Merge A B;
by ID;
run;
By default, MERGE statement performs full join so IN variables are not required.
One to Many Relationship : Duplicate Values in Primary Key
![]() |
Join : One to Many Relationship |
SQL Join will return Cartesian Product if duplicate values are found in primary key (common column). In this example, it returns cartesian product of missing values in the "ID" column. Since dataset A has 3 missing values and dataset B has 1 missing value, there are 3 (3*1) missing values in the merged dataset.
Data Step MERGE statement will return the maximum number of missing values in the primary key in both the tables. In this case, it would return 3 missing values i.e. max(3,1).
Example 2 : One to Many Relationship
![]() |
PROC SQL : One to Many Relationship |
How about six rows of value 5 in the combined table?
When duplicates, PROC SQL returns cartesian product i.e. product of both the tables. In dataset A, we have 2 5s and 3 5s in dataset B. So, it returns (2x3 = 6) 5s in the combined table.
How to refer to permanent library in SQL Joins
PROC SQL;Related Article - Data Step Merge
Create table dummy as
Select * from readin.A as x left join readin.B as y
On x.ID = y.ID;
Quit;
Its Very Useful... Thanks a lot sharing..!
ReplyDeleteThank you for stopping by. Glad you found the article useful :-)
DeleteWell explained! Keep up the good work :-)
ReplyDeleteThank you for your appreciation :-)
DeleteReally useful :-)
ReplyDeleteThanks Deepanshu for sharing this valuable knowledge. Its really a treasure for beginners like me.
ReplyDeleteWell explained..Its really useful especially for beginners. Thank u for sharing.
ReplyDeletereally very helpful ! i have one doubt explain the ten procedures which were used in SQL?
ReplyDeleteVery Helpful and easy to understand
ReplyDeleteThanks!
Extremely helpful, thank you very much !
ReplyDeleteReally very helpful but I found a small typo. In the full join SQL code wont it be 'full join' instead of 'right join'?
ReplyDeleteThanks for pointing it out, typo now corrected!
DeleteLove your site...I'm glad that I found this site...
ReplyDeleteGlad you found it useful. Cheers!
DeleteExcellent! good explanation with codes . . . very useful!
ReplyDeletethank you so much!!
ReplyDeleteLoved your explanation..Very useful for SAS beginners..Keep going
ReplyDeletevery usefull information..
ReplyDeleteHi Deepanshu,
ReplyDeletein all the joins it is mentioned
(Common columns in both the data sets + uncommon columns from data set A + uncommon columns from data set B).
is this true????
If you include "*" keyword, it would select variables from both the tables.
Deletefull join code should be like this
ReplyDeletePROC SQL;
Create table dummy as
Select coalesce (x.ID,y.ID) as ID,
coalesce (x.name,y.name) as name,height,weight
from A as x full join B as y
on x.ID = y.ID;
Quit;
Thanks for pointing it out. I have corrected the code. Cheers!
Deletefull join Data Step code should be
ReplyDeleteData dummy;
Merge A (IN = X) B (IN=Y);
by ID;
if x=1 or y=1;
run;
Please let me know if I am wrong
DeleteBy default, MERGE Statement performs FULL JOIN so "if x=1 or y=1;" is not required. Please let me know incase any query(s) Thanks!
DeleteThe output record will be prepared either from the record of A or B, hence the statement "if x=1 or y=1;" will not be required to mention and it will be by default.
DeleteWell explained about joins. Very use full for the beginners.
ReplyDeleteDid you read the first paragraph 'Advantages of PROC SQL Joins over Data Step Merging'? That's answer of your first question. Second question is also discussed in the article. If you have duplicates in any of the tables, you would get cartesian product of the duplicate records.
ReplyDeleteWas very useful.!! Well explained.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteWhat about Outer joins? How it will be prepared using PROC SQL?
ReplyDeleteThis is really great website. I'm glad I found it.
ReplyDeleteIt helpful for my understanding, Thank you
ReplyDeleteHow many datasets can be merged in proc sql (sas 9.4)?
ReplyDeleteI love your site to refer any concept in SAS or SQL
ReplyDeleteSuperb description about SAS
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteGreat, thanks for posting this.
ReplyDeleteexcellent explanation
ReplyDeleteHi Deepanshu,
ReplyDeleteI appreciate your work and thanks for creating a decent tool. If you'll mention the rules that would be great. Also add interesting scenarios bit logical concepts.
and Syntax of every function statement.
Thanks
I dont think ur explaination about inner join is accurate, ud better compare that with Proc SQL outter join.
ReplyDeleteWhy missing values are coming in right join but not in left join?
ReplyDeleteGreat Job!
ReplyDelete