This tutorial demonstrates how to speed up SAS code with Indexes.
Binary search is a performance improvement algorithm for searching a particular observation from a sorted variable. It works by continuously dividing in half of the total number of observations and search the value in the half of the list, until you got the value that you were looking for.
Example
3. Usage Level
If you use a variable frequently that is indexed, it makes sense as it improves the performance in terms of CPU time. But if you are creating an index for just a single usage, it's not a sensible idea to do it as it takes resources (CPU time, I/0 etc) to initially create an index. Hence, it's required to anticipate the usage of dataset with Index before creating it.
How to create Index with SAS
In SAS, there are several ways to create an index. It can be implemented with either of the following three options -
1. PROC DATASETS : Index
A simple Index can be created like below -
An composite index can be created like code below -
3. PROC SQL : Index
The syntax for creating a simple index with PROC SQL is as follows -
Performance Comparison
In this section of post, we are making comparison of filtering with and without Index to check the performance.
Let's create a sample for demonstration :
The following code would create three variables - k, custid, demog. The variable 'k' constitutes values ranging from 1 through 20 millions and the variable 'custid' would have same values as 'k' (just added 1) and the variable 'demog' is a categorical variable having 4 levels.
Filtering without Index
We are simply subsetting rows by few values in the variable 'custid'.
NOTE: DATA statement used (Total process time):
real time 3.47 seconds
cpu time 3.43 seconds
Key Points
SAS Index |
What is Index?
Indexes are not something technical or related to SAS programming. It is something we use everyday to make our life easy. For example, every employee in an organization has an employee ID which is unique. It's easy for HR /Admin team to find information about a particular employee. First Name or Last Name of employees are not unique so it's better to record information by unique ID. So, employee information is indexed by employee ID. Let's take few more examples - dictionary is alphabetically sorted. So alphabets are index in this case. There can be multiple indexes to find information. For example, books in a library are sorted by topics (Science) and sub-topics (Physics /Chemistry / Bio / Statistics).
In SAS, Index is used to store observations in an ascending order and later access them quickly from a variable. In simple words, it minimizes some steps of searching a particular value by telling SAS the nearest /exact location of the value you are searching for. Confused? Read the next section.
How Index works?
When Index is used, SAS runs a binary search algorithm on the data set.
Binary search is a performance improvement algorithm for searching a particular observation from a sorted variable. It works by continuously dividing in half of the total number of observations and search the value in the half of the list, until you got the value that you were looking for.
Example
Let's assume you have a variable CustomerID :
15, 20, 3, 16, 9, 17, 13
15, 20, 3, 16, 9, 17, 13
You are finding the information of customer having CustomerID equals to 17. See the steps below to find the value -
When to Use Index ?
1. Size of Subset Records
You should only use Index if you need to pull a small subset from a large SAS data set. See the definition of 'small' and 'large' in the table below -
2. Variable Consideration
It is recommended to index only those variables that have a high number of distinct values. For example, Customer ID as it is unique at customer level so it would have a high number of unique records. But the variable 'sex' would have only two distinct values so it would not be a good choice to index.
- First, sort the CustomerID variable. We will have these values - {3,9,13,15,16,17,20}
- Calculate the median (middle value) of the list i.e. 15.
- We would check whether search_value = median? Is 17 =15? No. 17 > 15.
- Ignore all the values that are less than or equal to 15 as 17 is a higher number than 15.
- Now, we need to search in the remaining list i.e. 16, 17, 20. The middle value is 17.
- Is 17 = 17? Yes. the value found.
When to Use Index ?
1. Size of Subset Records
You should only use Index if you need to pull a small subset from a large SAS data set. See the definition of 'small' and 'large' in the table below -
Subset Size | Will Indexing Improve? |
---|---|
1% - 15% | Definitely |
16% - 20% | Probably |
21% - 33% | Might Improve or Worse |
34% - 100% | Not Improve |
2. Variable Consideration
It is recommended to index only those variables that have a high number of distinct values. For example, Customer ID as it is unique at customer level so it would have a high number of unique records. But the variable 'sex' would have only two distinct values so it would not be a good choice to index.
3. Usage Level
If you use a variable frequently that is indexed, it makes sense as it improves the performance in terms of CPU time. But if you are creating an index for just a single usage, it's not a sensible idea to do it as it takes resources (CPU time, I/0 etc) to initially create an index. Hence, it's required to anticipate the usage of dataset with Index before creating it.
How to create Index with SAS
In SAS, there are several ways to create an index. It can be implemented with either of the following three options -
- PROC DATASETS
- INDEX = Data Step Option
- PROC SQL
1. PROC DATASETS : Index
A simple Index can be created like below -
proc datasets library=work nolist;
modify mydata;
index create custid;
quit;
Explanation :
- LIBRARY=WORK refers to the SAS temporary library that contains SAS data set 'mydata'
- NOLIST option hides the printing of the directory of SAS files in the SAS log and output window.
- MODIFY tells SAS we are creating an index in data set 'mydata'
- CUSTID in the 'Index Create' statement is the name of the variable for which we want to create an Index.
Index of two or more variables
It's called composite index when we create an index for two or more variables. In composite index, values of multiple variables are concatenated and form a single value which would be used for search specific values. The variables can be character or numeric or mixed (one character and the other one is numeric).
proc datasets library=mylibrary;
modify customermart;
index create names = (first last);
run;
In this case, 'names' is an index-name and it's created for two variables - first and last.
Only Unique Values
If you want to set condition that values for a variable must be unique, you can use UNIQUE option. For example, you know the customer ID would always be unique. When you use UNIQUE option, SAS would make sure there would not be any duplicate in simple or composite index. In composite index, it would check the uniqueness in combination of multiple variables. If someone would try to update the file with duplicates, it would throw an error.
2. INDEX = Data Set Option
We can create an index with dataset option. See the code below -
If you want to set condition that values for a variable must be unique, you can use UNIQUE option. For example, you know the customer ID would always be unique. When you use UNIQUE option, SAS would make sure there would not be any duplicate in simple or composite index. In composite index, it would check the uniqueness in combination of multiple variables. If someone would try to update the file with duplicates, it would throw an error.
proc datasets library=work nolist;
modify mydata;
index create custid / unique nomiss;
quit;
NOMISS Option : It does not mean the missing values cannot be added to the data set. It implies the missing values cannot be added to the index.
We can create an index with dataset option. See the code below -
data mydata (index=(custid / unique));
set mydata;
run;
data mydata (index=(names=(first last)));
set mydata;
run;
3. PROC SQL : Index
proc sql;
create index custid
on mydata;
quit;
Performance Comparison
In this section of post, we are making comparison of filtering with and without Index to check the performance.
Let's create a sample for demonstration :
The following code would create three variables - k, custid, demog. The variable 'k' constitutes values ranging from 1 through 20 millions and the variable 'custid' would have same values as 'k' (just added 1) and the variable 'demog' is a categorical variable having 4 levels.
data temp;
length demog $12.;
do k =1 to 20000000;
custid = k+1;
if mod(k,8)=0 then demog ='category i';
if mod(k,8)=1 then demog ='category ii';
if mod(k,8)=2 then demog ='category iii';
output;
end;
run;
Filtering without Index
We are simply subsetting rows by few values in the variable 'custid'.
data testing;NOTE: The data set WORK.TESTING has 4 observations and 3 variables.
set temp;
where custid in (5467620,225,2671899, 18000000);
run;
NOTE: DATA statement used (Total process time):
real time 3.47 seconds
cpu time 3.43 seconds
Filtering with Index
First, we are creating index with PROC DATASETS and subsetting data in the second section of the code.
First, we are creating index with PROC DATASETS and subsetting data in the second section of the code.
proc datasets library=work;modify temp;index create custid;quit;data testing2;set temp;where custid in (5467620,225,2671899, 18000000);run;
NOTE: The data set WORK.TESTING2 has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
cpu time 0.03 seconds
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
cpu time 0.03 seconds
Result :
Does Index always speed up SAS code?
Answer is NO. It may even slow down the code if we have a variable that has a very few distinct/unique values. For example, there is a variable called 'Age Group' that contains only 5 distinct values ranging from 1 to 5. 1 refers to the smallest age-group (<18 years old) and 5 refers to the highest age-group (>55 years old). Suppose you need to search 2 in the variable 'AgeGroup'. If we perform indexing on the variable, it would run binary search algorithm which would calculate the middle value and compare it with the searching value. It works iteratively (repetitively). It would take more time than that of sequentially searching '2' in the variable. See the live example below -
Example : We are extracting 'category i' from the variable 'demog'.
Without Index
NOTE: The data set WORK.TESTING2 has 2500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 13.12 seconds
cpu time 13.07 seconds
With Index
NOTE: The data set WORK.TESTING has 2500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 19.99 seconds
cpu time 13.71 seconds
In this case, creating an index on a variable took more time than without having index on the variable.
Uses of IndexingCompare CPU time of both the codes and see Indexing resulted to more than 100 times faster code than without using Index.
Does Index always speed up SAS code?
Answer is NO. It may even slow down the code if we have a variable that has a very few distinct/unique values. For example, there is a variable called 'Age Group' that contains only 5 distinct values ranging from 1 to 5. 1 refers to the smallest age-group (<18 years old) and 5 refers to the highest age-group (>55 years old). Suppose you need to search 2 in the variable 'AgeGroup'. If we perform indexing on the variable, it would run binary search algorithm which would calculate the middle value and compare it with the searching value. It works iteratively (repetitively). It would take more time than that of sequentially searching '2' in the variable. See the live example below -
Example : We are extracting 'category i' from the variable 'demog'.
Without Index
data testing;
set temp;
where lowcase(demog) = 'category i';
run;
NOTE: The data set WORK.TESTING2 has 2500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 13.12 seconds
cpu time 13.07 seconds
With Index
proc datasets library=work;
modify temp;
index delete custid;
index create demog;
quit;
data testing2;
set temp;
where lowcase(demog) = 'category i';
run;
NOTE: The data set WORK.TESTING has 2500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 19.99 seconds
cpu time 13.71 seconds
In this case, creating an index on a variable took more time than without having index on the variable.
1. Filtering rows with WHERE statement results in better performance of SAS Code when index is already created on the variable.
Important Point : SAS by default checks whether to use an index or not when you use WHERE statement.
2. Data need not to be prior sorted before running BY processing if index is turned on. For example, two datasets need to be sorted before using MERGE statement. But if index is already created on the datasets, you need to sort the data sets before MERGING. See the example below -
Important Point : SAS by default checks whether to use an index or not when you use WHERE statement.
2. Data need not to be prior sorted before running BY processing if index is turned on. For example, two datasets need to be sorted before using MERGE statement. But if index is already created on the datasets, you need to sort the data sets before MERGING. See the example below -
data a_index (index=(id));
set a;
run;
data b_index (index=(id));
set b;
run;
data final;
merge a_index(in=a) b_index(in=b);
by id;
if a=b;
run;
CAUTION : If you are creating index just for merging, it's not a good idea as Index increase resources which means increasing CPU processing time.
Important Point : Merging and Indexing
If you are merging/joining a small table with the large indexed table, the indexing would result a good performance. By 'small' table, it means at most 15% of the large indexed table. If you are merging two large indexed table, it might reduce the performance as it takes resources to create an index.3. The KEY= option in the SET statements allows you to perform efficient merging.
Key Points
1. IF statement does not use an index. Whereas WHERE statement makes use of it.
2. You can run PROC CONTENTS to see the names of the variables which are used for indexing.
3. You can delete indexes by the following ways :
- PROC DATASETS : Use command 'index delete index-name;'
- PROC SQL : Use command 'drop index index-name from dataset-name;'
Endnotes
SAS indexing capabilities can increase the performance of your SAS code which leads to a significant time saving. But we also need to consider the points listed above wherein it can reduce the performance or might not improve it. It's important to remember the point that indexing increases in the size of the data and it takes time to create an Index. Hence, we should create an index only if the usage of the key variable on dataset is very high.
Just go to the end of this group of tutorials, and I have learnt so much!! Thank you so much for all of your hard work doing this :)
ReplyDeleteI have learnt so much knowledge reading listen data concepts.Thanks to listen data.
ReplyDeleteVery impressive. Information is nice & crisp.
ReplyDelete