How Data Step and PROC SQL Works

This tutorial explains the steps to process data in SAS.

How Data Step Works

The processing of data is in 2 steps :

1. Compilation Phase

Step I : Syntax Checking

SAS scans each statement in the DATA step and check syntax errors, such as missing semicolons and invalid statements.

Step II : Creating Input Buffer

If you read in a raw data set such as txt or csv file, the input buffer is created. The input buffer is used to hold raw data. If you read in a SAS data set, the input buffer will not be created.

Step III : Creating Program Data Vector (PDV)

1. SAS creates a program data vector (memory on your system) containing the automatic variables _N_ and _ERROR_.

How PDV works

_N_ = 1 indicates the first observation is being processed, _N_ = 2 indicates the second observation is being processed, and so on.

The automatic variable _ERROR_ with values of 1 or 0, if it is equal to 1 signals the data error of the currently-processed observation, such as reading the data with an incorrect data type.

2. In addition to the two automatic variables, there is one space allocated for each of the variables in the input statement (reading data).

3. SAS also adds a position to the program data vector for any variables that are created in the DATA step. See the program below. The newly created FinalScore variable is derived from Score.

Data temp2;
set temp;
FinalScore = Score + 25; Run;

Newly Created Variable

4. If the variables specified in the DROP statement, it will never be written to the output data set.

5. At the end of the compilation phase, SAS makes the descriptor portion of the SAS data set which includes the data set name, the number of observations, and the number, names, and attributes of variables.

2. Execution Phase

Sequential Processing (Iterative)

The DATA step executes once for each observation in the input data set. Suppose dataset consists of 500 records, SAS will execute 500 times.

1. At the beginning of the execution phase, SAS sets all of the data set variables in the program data vector to missing:

Execution Phase Step I

Variables that you specify in a RETAIN statement are not reset to missing values.

2. The SET statement reads the first observation from the input data set and writes the values to the program data vector.

Exceution Phase Step II

3. Compute the first value for the derived variable, FinalScore.

Execution Phase Step III

4. At the end of the first iteration of the DATA step, the values in the program data vector are written to the output data set temp2 as the first observation.

Loop until all of the observations are read

The value of the automatic variable _N_ is increased to 2. The values from the second observation are written to the program data vector. Repeat above Steps 2, 3 and 4. It continues until all of the observations are read.

How PROC SQL Works

PROC SQL is a simultaneous process for all the observations.

Step I : Syntax Checking

SAS scans each statement in the SQL procedure and check syntax errors, such as missing semicolons and invalid statements.

Step II : SQL Optimizer

SQL optimizer scans the query inside the statement. The SQL Optimizer decides how the SQL query should be executed in order to minimize run time. The Optimizer examines submitted SQL code and characteristics of the SAS system and then creates efficient executable statements for the submitted query. The created code can be quite complicated and often involves the creating, sorting and merging of many temporary files as well as the trimming of variables and observation at times that will minimize run time.

Step III : Load into Data Engine

Any tables in the FROM statement are loaded into the data engine where they can then be accessed in memory.

Step IV : Code and Calculations are executed

Step V : Final Table is created in memory

Step VI : Final Table sent to the output table described in the CREATE TABLE statement.

Main Distinction : Data Step vs. Proc SQL

DATA step is a sequential process (one at a time for each observation), whereas the SQL procedure is a simultaneous process for all the observations.

Efficiency : DATA STEP vs. PROC SQL

The SQL procedure performed better with the smaller datasets (less than approx. 100 MB) whereas the data step excelled with the larger ones (more than approx. 100 MB).

It is because the DATA step handles each record sequentially so it never uses a lot of memory, however, it takes time to process one at a time. So with a smaller dataset, the DATA step is going to take more time sending each record through.

With the SQL procedure, everything is loaded up into memory at once. By doing this, the SQL procedure can process small datasets rather quickly since everything is available in memory. Conversely, when you move to larger datasets, your memory can get bogged down which then leads to the SQL procedure being a little bit slower compared to the DATA step which will never take up too much memory space.

If you need to connect directly to a database and pull tables from there, then use PROC SQL.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 18 Responses to "How Data Step and PROC SQL Works"

UnknownMay 30, 2016 at 9:22 AM
Excellent article.... really helpful. Explained a really complex process in such an easy manner... great work.. awesome site 👍
UnknownJuly 2, 2016 at 9:46 PM
Explained in simple n effective manner
UnknownSeptember 4, 2016 at 8:05 AM
can you explain proc datasets procedure with detailed?
AnonymousJanuary 13, 2017 at 11:07 AM
loved it.can learn a lot from you.suggest good books,articles,blogs,websites like this to learn more and more.thank you so much for writing sas tutorials.
AnonymousJanuary 15, 2017 at 7:29 AM
Excellent. Thanks a ton....Keep up good work.
UnknownJanuary 21, 2017 at 8:42 AM
Nice article.. Thanks
AnonymousJanuary 25, 2017 at 7:10 AM
Loved it
AryaMarch 4, 2017 at 9:01 AM
Excellent, Really learn a lot. Thanks for sharing the knowledge. Good work.
UnknownMay 24, 2017 at 11:09 AM
vry nicely explained, thanks deepanshu..!!!
AnonymousJuly 11, 2017 at 1:41 PM
I have read many articles to explain how data step and proc sql works. This is the easiest ways to understand a really complex processes. Excellent work,very resourceful site ��
AnonymousOctober 9, 2018 at 10:16 PM
It's my fav site... Thanks a million for ur great work... go ahead... Regards
MithunJanuary 28, 2019 at 9:03 PM
Hi. I need to pull Global Address List from outlook using SAS code. Please help me in this.
rajeshJanuary 7, 2020 at 4:41 PM
At the end of the compilation phase, SAS makes the descriptor portion of the SAS data set which includes the number of observations and etc.
how it knows number of observations in compilation phase??
UnknownDecember 25, 2020 at 2:16 AM
These notes helped me a lot in learning SAS.
UnknownJanuary 7, 2021 at 12:28 AM
THANK YOU SOOOO MUCH THIS HELPING ME SO MUCH ,KEEPING IMPROVING UR SITE
AnonymousFebruary 1, 2021 at 6:20 AM
Awesome job mate!