This tutorial explains various methods to combine or append datasets in SAS, along with examples.
In SAS, datasets can be combined using PROC APPEND, SET statement and PROC SQL UNION methods. For example, you have multiple human records files from various departments of your company and you are asked to combine them so that there would be a single file containing information of all the departments.
Let's create two sample datasets named 'Dataset1' and 'Dataset2' to explain examples in this tutorial.
Data Dataset1; Input Name $ Score; cards; David 30 Ram 25 Sam 74 Sandy 36 ; run; Data Dataset2; Input Name $ Score; cards; Ken 36 Obama 74 Raj 30 Shyam 25 ; run;
1. Concatenate Datasets
In this method, we specify the datasets in the SET statement that need to be combined.
Data Stack; Set Dataset1 Dataset2; Run;
Note : The stacked data set is not sorted because we have not used BY statement.
2. Sorted Stacked Dataset
Here we are concatenating sorted datasets into one sorted dataset. You do this by using a SET statement and a BY statement in a DATA step.
Note : Make sure data sets are sorted before using BY statement. Datasets can be sorted with PROC SORT.
proc sort data = dataset1; by name; run; proc sort data = dataset2; by name; run;
Data Stack1;
Set Dataset1 Dataset2;
By Name;
Run;
Output
3. PROC SQL for Appending Datasets
OUTER UNION CORR keyword is used in PROC SQL to concatenate two data sets. The CORR tells SAS to append data sets by name (not by column position).
PROC SQL;
CREATE TABLE stackk AS
SELECT *
FROM Dataset1
OUTER UNION CORR
SELECT *
FROM Dataset2
ORDER BY Name;
QUIT;
The output of PROC SQL is same as the output of previous example.
4. PROC APPEND
In PROC APPEND, the data set specified in BASE= option refers to a data set in which other data set would be added or appended. In log, it writes 'Appending Dataset2 to Dataset1' for our example. After running this code, the dataset1 contains 8 records ( 4 from the original 'dataset1' file and 4 from the dataset2)
proc append base=dataset1 data=dataset2; run;
If you want to append data and store it to another dataset, you can run PROC APPEND twice to do it. In the first PROC APPEND, it would create a base table ALLDATA (as specfied in the code below). If the dataset ALLDATA does not already exist, it would be automatically created by SAS.
proc append base=alldata data=dataset1; run; proc append base=alldata data=dataset2; run;
PROC APPEND is faster than using the SET statement or PROC SQL UNION because it only reads the dataset being added (specified by the 'DATA=' option).
PROC APPEND is most useful when you use it in a macro. For example, you create a macro in which there is a loop for calculation of some metrics of multiple variables. In each iteration, it returns a data set which needs to be appended with the output of subsequent iterations so that once loop completes all the iterations, we would have complete data set.
Suppose you have two datasets having same variable names but the length of the common variable is different, It would throw a warning and it would not append datasets. To workaround this issue, we can use FORCE option to append the data sets.
proc append base=dataset1 data=dataset2 force; run;
5. Uses of Multiple SET Statements
Multiple SET statements combine datasets by adding column from the other dataset. Unlike INNER/LEFT Joins, it does not require any primary or unique key to join two datasets.
Data Dataset3; Input Section $; cards; A B C D ; run;
Data Stack3; Set Dataset1; Set Dataset3; Run;
For example, if we use multiple SET statements instead of specifying multiple datasets in a single SET statement, the result would be completely different. See the output shown below:
Data Stack3; Set Dataset1; Set Dataset2; Run;
Warning : It overwrites data.
Multiple SET statement can also produce cartesian product of two tables. It can be done with POINT= option.
Data Stack2; set Dataset1; do i=1 to num; set Dataset3 nobs=num point=i; if i=i then output; end; run;
Hi,
ReplyDeletewhat is the effect if we don't sort the data sets before appending?
If you are using BY statement in merging but dont sort the data then the code will error out.
DeleteWhat does i=i do in the last code? Won't it always result in TRUE?
ReplyDelete