Lesson 1 : PROC SQL Tutorial for Beginners (20 Examples)

Live Online Training : SAS Programming with 50+ Case Studies

- Explain Programming Concepts in Simple English
- Live Projects
- Scenario Based Questions
- Job Placement Assistance
- Get 10% off till Sept 25, 2017
- Batch starts from October 8, 2017

This tutorial is designed for beginners who want to get started with PROC SQL. Also, it will attempt to compare the techniques of DATA Step and PROC SQL.

TERMINOLOGY
The difference between SAS and SQL terminology is shown in the table below.
SAS vs. SQL
SYNTAX
PROC SQL;
 SELECT column(s)
 FROM table(s) | view(s)
 WHERE expression
 GROUP BY column(s)
 HAVING expression
 ORDER BY column(s);
QUIT;
The SELECT statement must be specified in the following order: 
1.SELECT 
2.FROM 
3.WHERE 
4.GROUP BY 
5.HAVING 
6.ORDER BY 

Note:  Only the SELECT and FROM clauses are required. All the other clauses are optional.

Tip to Memorize SQL Query Order

EXPLANATION
PROC SQL: calls the SQL procedure
SELECT: specifies the column(s) (variables) to be selected
FROM: specifies the table(s) (data sets) to be queried
WHERE: subsets the data based on a condition
GROUP BY: classifies the data into groups based on the specified column(s)
ORDER BY: sorts the resulting rows (observations) by the specified column(s)
QUIT: ends the PROC SQL procedure

We are going to look at the difference between Non-SQL Base SAS and PROC SQL.

SAS Functions vs. SQL Functions
Important Terminology
You would hear the word 'schema' from SQL programmers.  A permanent library in SAS is the same as a 'schema' in Oracle and 'database' in Teradata.
PROC SQL STATEMENTS
1. Selecting all variables from the data set
Asterisk (*) is used to select all columns (variables) in the order in which they are stored in the table.
Outdata is the table (data set) from which we need to select the columns (variables) . It is stored in SASUSER library.
To display the list of columns to the SAS log, use FEEDBACK option in the PROC SQL statement

The SAS log is shown below:


2. Selecting specific variables from the data set
In the SELECT clause, multiple columns are separated by commas.
In the SELECT clause, Weight and Married columns (variables) are specified so that we can select them from OUTDATA table (data set).


3. Limiting the number of rows
Suppose you want to limit the number of rows (observations) that PROC SQL displays, use the OUTOBS= option in the PROC SQL statement.

4. Renaming a variable in output 
Suppose you want to rename a variable, use the column alias AS option in the PROC SQL statement.
The variable name has been renamed from married to marriage.


5. Creating a new variable
Suppose you want to create a new variable that contains calculation.

A new variable has been created and named newweight which is calculated on the basis of the existing variable weight.


6. Referring to a previously calculated variable
The keyword CALCULATED is used to refer a previously calculated variable.

7. Removing duplicate rows
The keyword DISTINCT is used to eliminate duplicate rows (observations) from your query results.

In the following program, we are asking SAS to remove all those cases where in duplicates exist on combination of both the variables - weight and married.
proc sql;
select DISTINCT weight, married 
from sasuser.outdata;
quit; 
The DISTINCT * implies cases having same values in all the variables as a whole would be removed.
proc sql;
select DISTINCT *
from sasuser.outdata;
quit; 

8. Labeling and formatting variables
SAS-defined formats can be used to improve the appearance of the body of a report. You can also label the variables using LABEL keyword.

9. Sorting data
The ORDER BY clause returns the data in sorted order.

ASC option is used to sort the data in ascending order. It is the default option.
DESC option is used to sort the data in descending order


10. Subsetting data with the WHERE clause
Use the WHERE clause with any valid SAS expression to subset data.

List of conditional operators :
1. BETWEEN-AND
The BETWEEN-AND operator selects within an inclusive range of  values.
Example : where salary between 4500 and 6000;

2. CONTAINS or ?
The CONTAINS or ? operator  selects observations by searching for a specified set of characters within the values of a character variable
Example : where firstname contains ’DE’;
OR   where firstname ? ’DE’;

3. IN
The IN operator selects from a list of fixed values.
Example : where state = 'NC' or state = 'TX';
The easier way to write the above statement would be to use the IN operator
where state IN ('NC','TX');
4. IS MISSING or IS NULL
The IS MISSING or IS NULL operator selects missing values.
Example : where dateofbirth is missing
OR  where dateofbirth is null

5. LIKE

The LIKE Operator is used to select a pattern.





Important Point :
The WHERE clause can reference a previously calculated variable in two ways-
1. Use CALCULATED keyword.
2. Repeat the calculation in the WHERE clause.

Method I :
PROC SQL;
 SELECT NAME, (WEIGHT * .01) AS NEWWEIGHT
 FROM HEALTH
 WHERE CALCULATED NEWWEIGHT > 5;
QUIT;
Method II :
PROC SQL;
 SELECT NAME, (WEIGHT * .01) AS NEWWEIGHT
 FROM HEALTH
 WHERE (WEIGHT * .01) > 5;
QUIT;

11. Multiple Conditions / Criteria

The END is required when using the CASE.

The following operators can be used in CASE expression:
• All operators that IF uses (= , <, >, NOT, NE, AND, OR, IN, etc)
• BETWEEN AND
• CONTAINS or ‘?’
• IS NULL or IS MISSING
• = *
• LIKE

12. Aggregating or summarizing data

Use GROUP BY clause to summarize data. Summary functions are used on the SELECT statement to produce summary for each of the analysis variables.


The summary functions available are listed below:
  1. AVG/MEAN
  2. COUNT/FREQ/N 
  3. SUM
  4. MAX 
  5. MIN 
  6. NMISS 
  7. STD 
  8. VAR
  9. T (t value)
  10. USS (Uncorrelated Sum of Square)
  11. CSS (Correlated Sum of Square)
  12. RANGE
13. Subsetting data in the groups
In order to subset data when grouping is in effect, the HAVING clause must be used. The variable specified in having clause must contain summary statistics.

Important Point -
The WHERE clause cannot be used with the GROUP BY clause. To subset data with the GROUP BY clause you must use HAVING clause.
14. Creating a new data set as output

The CREATE TABLE statement can be used to create a new data set as output instead of a report produced in output window.

SYNTAX
PROC SQL;
 CREATE TABLE table-name AS
 SELECT column(s)
 FROM table(s) | view(s)
 WHERE expression
 GROUP BY column(s)
 ORDER BY column(s);
 QUIT; 

15. Limiting the number of rows in the new created data set
Suppose you want to limit the number of rows (observations) that PROC SQL produces in the data set, use the INOBS= option in the PROC SQL statement.


16. Counting unique values by a grouping variable
Suppose you want to calculate number of students who have got same scores by their college names.

You can use PROC SQL with COUNT(DISTINCT variable_name) to determine the number of unique values for a column.
PROC SQL;
CREATE TABLE TEST1 as
SELECT college_names,count(distinct score) AS unique_count
FROM test
GROUP BY college_names;
QUIT;

17. Count the number of missing values

You can use NMISS() function to compute the number of missing values in a variable. The COUNT() function returns the number of non-missing values in a variable.

data temp;
input id;
cards;
1
2
.
4
5
.
;
run;
proc sql;
select nmiss(id) as N_missings,
count(id) as N,
calculated N_missings + calculated N as total
from temp;
quit;
Proc SQL : Number of Missing Values

How to refer to a calculated variable

The keyword CALCULATED is used to refer to a newly created variable for further calculation. In this case, we have used CALCULATED to sum 'N_MISSINGS' and 'N' variables.


18. KEEP and DROP some variables

Suppose you need to keep all the variables from SASHELP.CARS except variables 'MODEL' and 'MAKE'. The DROP= option is used to drop these two variables. Similarly, we can use KEEP= option to keep specific variables. These DROP= and KEEP= Options are not native SQL language. It only works in SAS.
proc sql;
create table saslearning (drop= make model) as
select * from sashelp.cars;
quit;

19. Delete Rows from a Table

You can use DELETE FROM  statement to remove records (rows) from a dataset.
proc sql;
delete from mydata
where var1 > 20;
quit;
In this case, we are deleting all records having value greater than 20 in var1 from mydata dataset.

20. Sub Query

Find employee IDs who have records in table file1 but not in table file2.
Proc SQL;
Select ID from file1
Where ID not in (select ID from file2);
Quit;
21. Sub Query - Part II
Find employee IDs whose age is in the average age +/- 10 years.
Proc SQL;
Select id from file1
where age between (select Avg(age) from file1) - 10 and
(select avg(age) from file1)+10;
Quit;

22. Sub Query - Part III
proc sql;
select Name, Grade, Teacher,
Case
When Student_ID in
(select Student_ID from Tests where Score lt 70) then 'Failed one or more tests'
else 'Passed all tests'
end as Test_Results
from Students;
quit;

Next Lesson : PROC SQL Joins

Proc SQL Tutorials : 15 Proc SQL Tutorials


SAS Tutorials : 100 Free SAS Tutorials

About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like retail and commercial banking, Telecom, HR and Automotive.


While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

33 Responses to "Lesson 1 : PROC SQL Tutorial for Beginners (20 Examples)"

  1. Well Done Deepanshu, I was almost new to SQL.
    But Now I can say that atleast I have learnt basic steps and statements for PROC SQL.
    Could you be so kind to share codes for joints as well.

    ReplyDelete
    Replies
    1. Check out this article : http://www.listendata.com/2014/06/proc-sql-merging.html

      Delete
  2. Excellent information! Thanks.

    ReplyDelete
  3. This is really wonderful.
    It is better, let the students try then provide the answers.

    ReplyDelete
  4. VERY NICE AND HELPFUL MATERIAL, HOWEVER I SEE ONE CORRECTION, WHERE CLUASE CAN REFER A COMPUTED VARIABLE USING THE CALCULATED KEYWORD.

    PLEASE CORRECT ME IF MY UNDERSTANDING IS WRONG.

    ReplyDelete
    Replies
    1. Yes, your understanding is correct. We can use CALCULATED keyword in WHERE clause. Please make a note that the CALCULATED keyword in WHERE clause is not a standard SQL. It’s SAS SQL, which is unique to SAS. Hence i don't recommend to people who are novice to SQL programming.

      Delete
  5. nice to understand...its very use full for freshers...

    ReplyDelete
  6. very informative post.Easy to understand for learning SQL.

    ReplyDelete
  7. very nice presentation on sql

    ReplyDelete
  8. Replies
    1. Check out this article - http://www.listendata.com/2014/08/download-and-install-free-version-sas.html

      Delete
  9. Sincerely admiring ur effort to share knowledge to us. Keep it up sir ji...

    ReplyDelete
    Replies
    1. Thank you for your appreciation. Cheers!

      Delete
  10. so useful information to have an idea about sql. it also helpful for quick reference toooo.

    ReplyDelete
  11. I am learning SAS through this. Excellent info for beginners

    ReplyDelete
  12. Thank you so much for this exposition and explanation. Well done!!!

    ReplyDelete
  13. Hi Deepanshu could u tell me from where I can get sasuser.outdata dataset
    outdata (dataset I mean)

    ReplyDelete
  14. I think you should add screenshots of program and their output

    ReplyDelete
  15. I think you should add screenshots of program and their output

    ReplyDelete
  16. Screenshot would have been further ease the understanding

    ReplyDelete
  17. Great lesson! Thank you so much!

    ReplyDelete
  18. proc sql;
    select distinct*
    from sashelp.class;
    quit;

    in this code it is not removing duplicate values in the whole dataset.why?

    ReplyDelete
    Replies
    1. It is because no duplicate record is found in this dataset. If you intend to remove duplicates based on a variable, you can use the code below -
      proc sql;
      select distinct age
      from sashelp.class;
      quit;

      Delete
  19. Thank you for your post! Very useful and helpful information for a beginner like me.

    ReplyDelete
  20. hi... i am not getting difference between inobs and outobs even when i am trying both the output is same...plz help...

    ReplyDelete
  21. This comment has been removed by the author.

    ReplyDelete
  22. Thank you for this great tutorial. Can I practice these Proc SQL statements/code with a regular SAS dataset or do I need to use something else? Thanks!

    ReplyDelete
  23. How i am unable to get SAS data set used in this exercise.. Could you please help me..?

    ReplyDelete
  24. Difference between inobs and outobs

    ReplyDelete
  25. Its very helpful. Thank you Admin.

    ReplyDelete

Next → ← Prev