Speed up SAS Code with Index

Deepanshu Bhalla 3 Comments
This tutorial demonstrates how to speed up SAS code with Indexes.
SAS Index

What is Index?

Indexes are not something technical or related to SAS programming. It is something we use everyday to make our life easy. For example, every employee in an organization has an employee ID which is unique. It's easy for HR /Admin team to find information about a particular employee. First Name or Last Name of employees are not unique so it's better to record information by unique ID. So, employee information is indexed by employee ID. Let's take few more examples - dictionary is alphabetically sorted. So alphabets are index in this case. There can be multiple indexes to find information. For example, books in a library are sorted by topics (Science) and sub-topics (Physics /Chemistry / Bio / Statistics).

In SAS, Index is used to store observations in an ascending order and later access them quickly from a variable. In simple words, it minimizes some steps of searching a particular value by telling SAS the nearest /exact location of the value you are searching for. Confused? Read the next section.

How Index works?

When Index is used, SAS runs a binary search algorithm on the data set.

Binary search is a performance improvement algorithm for searching a particular observation from a sorted variable. It works by continuously dividing in half of the total number of observations and search the value in the half of the list, until you got the value that you were looking for.

Example


Let's assume you have a variable CustomerID :
15, 20, 3, 16, 9, 17, 13
You are finding the information of customer having CustomerID equals to 17. See the steps below to find the value -
  1. First, sort the CustomerID variable. We will have these values - {3,9,13,15,16,17,20}
  2. Calculate the median (middle value) of the list i.e. 15.
  3. We would check whether search_value = median? Is 17 =15? No. 17 > 15.
  4. Ignore all the values that are less than or equal to 15 as 17 is a higher number than 15.
  5. Now, we need to search in the remaining list i.e. 16, 17, 20. The middle value is 17.
  6. Is 17 = 17? Yes. the value found.
If you do not create an index SAS would search 17 sequentially in the whole list. The above is a simple example of a few values. If you have millions of observations, it would take a hell lot of time to search a particular value sequentially in the variable.

When to Use Index ?

1. Size of Subset Records

You should only use Index if you need to pull a small subset from a large SAS data set. See the definition of 'small' and 'large' in the table below -
Subset Size Will Indexing Improve?
1% - 15% Definitely
16% - 20% Probably
21% - 33% Might Improve or Worse
34% - 100% Not Improve

2. Variable Consideration

It is recommended to index only those variables that have a high number of distinct values. For example, Customer ID as it is unique at customer level so it would have a high number of unique records.  But the variable 'sex' would have only two distinct values so it would not be a good choice to index.


3. Usage Level

If you use a variable frequently that is indexed, it makes sense as it improves the performance in terms of CPU time. But if you are creating an index for just a single usage, it's not a sensible idea to do it as  it takes resources (CPU time, I/0 etc) to initially create an index. Hence, it's required to anticipate the usage of  dataset with Index before creating it.


How to create Index with SAS

In SAS, there are several ways to create an index. It can be implemented with either of the following three options -

  1. PROC DATASETS
  2. INDEX = Data Step Option
  3. PROC SQL

1. PROC DATASETS : Index

A simple Index can be created like below -
proc datasets library=work nolist;
modify mydata;
index create custid;
quit;
Explanation :
  1. LIBRARY=WORK refers to the SAS temporary library that contains SAS data set 'mydata'
  2. NOLIST option hides the printing of the directory of SAS files in the SAS log and output window.
  3. MODIFY tells SAS we are creating an index in data set 'mydata'
  4. CUSTID in the 'Index Create' statement is the name of the variable for which we want to create an Index.

Index of two or more variables

It's called composite index when we create an index for two or more variables. In composite index, values of multiple variables are concatenated and form a single value which would be used for search specific values. The variables can be character or numeric or mixed (one character and the other one is numeric).
proc datasets library=mylibrary;
modify customermart;
index create names = (first last);
run;
In this case, 'names' is an index-name and it's created for two variables - first and last.

Only Unique Values

If you want to set condition that values for a variable must be unique, you can use UNIQUE option. For example, you know the customer ID would always be unique. When you use UNIQUE option, SAS would make sure there would not be any duplicate in simple or composite index. In composite index, it would check the uniqueness in combination of multiple variables. If someone would try to update the file with duplicates, it would throw an error.
proc datasets library=work nolist;
modify mydata;
index create custid / unique nomiss;
quit;
NOMISS Option : It does not mean the missing values cannot be added to the data set. It implies the missing values cannot be added to the index.


2. INDEX = Data Set Option

We can create an index with dataset option. See the code below -
data mydata (index=(custid / unique));
set mydata;
run;
An composite index can be created like code below -
data mydata (index=(names=(first last)));
set mydata;
run;

3. PROC SQL : Index

The syntax for creating a simple index with PROC SQL is as follows -
proc sql;
create index custid
on mydata;
quit;

Performance Comparison

In this section of post, we are making comparison of filtering with and without Index to check the performance.

Let's create a sample for demonstration :

The following code would create three variables - k, custid, demog. The variable 'k' constitutes values ranging from 1 through 20 millions and the variable 'custid' would  have same values as 'k' (just added 1) and the variable 'demog' is a categorical variable having 4 levels.
data temp;
length demog $12.;
do k =1 to 20000000;
custid = k+1;
if mod(k,8)=0 then demog ='category i';
if mod(k,8)=1 then demog ='category ii';
if mod(k,8)=2 then demog ='category iii';
output;
end;
run;

Filtering without Index

We are simply subsetting rows by few values in the variable 'custid'.
data testing;
set temp;
where custid in (5467620,225,2671899, 18000000);
run;
NOTE: The data set WORK.TESTING has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 3.47 seconds
cpu time 3.43 seconds


Filtering with Index

First, we are creating index with PROC DATASETS and subsetting data in the second section of the code.
proc datasets library=work;
modify temp;
index create custid;
quit;

data testing2;
set temp;
where custid in (5467620,225,2671899, 18000000);
run;
NOTE: The data set WORK.TESTING2 has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.09 seconds
      cpu time            0.03 seconds


Result : 
Compare CPU time of both the codes and see Indexing resulted to more than 100 times faster code than without using Index.

Does Index always speed up SAS code?

Answer is NO. It may even slow down the code if we have a variable that has a very few distinct/unique values. For example, there is a variable called 'Age Group' that contains only 5 distinct values ranging from 1 to 5. 1 refers to the smallest age-group (<18 years old) and 5 refers to the highest age-group (>55 years old). Suppose you need to search 2 in the variable 'AgeGroup'. If we perform indexing on the variable, it would run binary search algorithm which would calculate the middle value and compare it with the searching value. It works iteratively (repetitively). It would take more time than that of sequentially searching '2' in the variable. See the live example below -

Example : We are extracting 'category i' from the variable 'demog'.

Without Index
data testing;
set temp;
where lowcase(demog) = 'category i';
run;

NOTE: The data set WORK.TESTING2 has 2500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           13.12 seconds
      cpu time            13.07 seconds


With Index
proc datasets library=work;
modify temp;
index delete custid;
index create demog;
quit;

data testing2;
set temp;
where lowcase(demog) = 'category i';
run;

NOTE: The data set WORK.TESTING has 2500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           19.99 seconds
      cpu time            13.71 seconds


In this case, creating an index on a variable took more time than without having index on the variable.

Uses of Indexing

1. Filtering rows with WHERE statement results in better performance of SAS Code when index is already created on the variable.

Important Point : SAS by default checks whether to use an index or not when you use WHERE statement.

2. Data need not to be prior sorted before running BY processing if index is turned on. For example,  two datasets need to be sorted before using MERGE statement. But if index is already created on the datasets, you need to sort the data sets before MERGING. See the example below -
data a_index (index=(id));
set a;
run;
data b_index (index=(id));
    set b;
 run;
data final;
merge a_index(in=a) b_index(in=b);
by id;
if a=b;
run;

CAUTION : If you are creating index just for merging, it's not a good idea as Index increase resources which means increasing CPU processing time.

Important Point : Merging and Indexing
If you are merging/joining a small table with the large indexed table, the indexing would result a good performance. By 'small' table, it means at most 15% of the large indexed table. If you are merging two large indexed table, it might reduce the performance as it takes resources to create an index.
3. The KEY= option in the SET statements allows you to perform efficient merging.

Key Points


1. IF statement does not use an index. Whereas WHERE statement makes use of it.

2. You can run PROC CONTENTS to see the names of the variables which are used for indexing.

3.  You can delete indexes by the following ways : 
  • PROC DATASETS : Use command 'index delete index-name;'
  • PROC SQL : Use command 'drop index index-name from dataset-name;'

Endnotes
SAS indexing capabilities can increase the performance of your SAS code which leads to a significant time saving. But we also need to consider the points listed above wherein it can reduce the performance or might not improve it. It's important to remember the point that indexing increases in the size of the data and it takes time to create an Index. Hence, we should create an index only if the usage of the key variable on dataset is very high.
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 3 Responses to "Speed up SAS Code with Index"
  1. Just go to the end of this group of tutorials, and I have learnt so much!! Thank you so much for all of your hard work doing this :)

    ReplyDelete
  2. I have learnt so much knowledge reading listen data concepts.Thanks to listen data.

    ReplyDelete
  3. Very impressive. Information is nice & crisp.

    ReplyDelete
Next → ← Prev