SAS: Split Data into Training and Test Datasets

This tutorial explains the various methods to split your data into training and test (or validation) datasets in SAS.

Splitting data into training and test datasets is crucial for developing, evaluating, and improving predictive models. The training dataset is used to build and train your predictive model. After training, the model's performance needs to be evaluated using unseen data. This is where the test dataset comes in. It is to identify and address issues such as overfitting, which occurs when a model performs exceptionally well on the training data but poorly on unseen (test) data.

Method 1: Simple Random Sampling using PROC SURVEYSELECT

By default, the PROC SURVEYSELECT procedure performs simple random sampling. The following code splits the dataset: 70% of the data goes into the training dataset, while the remaining 30% goes into the test dataset. It first performs simple random sampling on the sashelp.heart dataset using a 70% sampling rate, and then separates the selected observations into the heart_train dataset while placing the non-selected observations in the heart_test dataset.

proc surveyselect data=sashelp.heart rate=0.7 outall out=heart2 seed=1234;
run;

data heart_train heart_test; 
set heart2; 
if selected =1 then output heart_train; 
else output heart_test; 
drop selected;
run;

seed=1234: This sets the random number seed for reproducibility. It ensures that the same split is generated in the random sampling every time you run the code.

To validate the split, you can run PROC FREQ to see the number of observations in these two datasets along with the distribution of dependent variable.

proc freq data=heart_train;
table status;
run;

proc freq data=heart_test;
table status;
run;
Split Data into Training and Test Datasets in SAS

If you observe the distribution (percent) of dependent variable (status), you would find it is not consistent in the training and test datasets. It is 38.5% in training dataset and 37.58% in test dataset. The difference is not huge but when working with imbalanced datasets, where one category is significantly more prevalent than others, simple random sampling for splitting data into training and test sets might result in a disproportionate representation of categories in the subsets. This can lead to biased model performance evaluation and potentially poor generalization to new data. Stratified sampling helps to address this issue by maintaining the same proportion of categories in both the training and test datasets as in the original dataset. We will cover this in the next section.

Method 2: Stratified Sampling using PROC SURVEYSELECT

The benefit of the stratified sampling is that it ensures that the distribution of your dependent variable remains the same when splitting data into training and test datasets.

The following SAS code performs a stratified random sampling using the PROC SURVEYSELECT procedure and then splitting the selected data into training and test datasets. This line strata status; indicates that the variable "status" is used as the stratification variable for sampling. This means that the sampling process will be performed within each unique value of "status". Make sure tha strata variable is sorted before running the PROC SURVEYSELECT procedure.

proc sort data= sashelp.heart out=heart;
by status;
run;

proc surveyselect data=heart rate=0.7 outall out=heart2 seed=1234;
strata status;
run;

data heart_train heart_test; 
set heart2; 
if selected =1 then output heart_train; 
else output heart_test; 
drop selected;
run;

proc freq data=heart_train;
table status;
run;

proc freq data=heart_test;
table status;
run;
Stratified Sampling in SAS

The distribution of status is same in both the training and test datasets. It is 38.22% in both the datasets.

Method 3: Simple Random Sampling using ranuni() Function

The following code shows how to split a dataset into training and testing datasets using the ranuni function. This function is used to generate a random number for each observation, which is then used to shuffle the dataset before the split.

data heart2;
set sashelp.heart;
n=ranuni(1234);
run;

proc sort data=heart2;
by n;
run;
  
data heart_train heart_test;
set heart2 nobs=nobs;
if _n_<=.7*nobs then output heart_train;
else output heart_test;
run;

The nobs option is used to obtain the total number of observations in the heart2 dataset. The _n_ variable represents the current observation number in the data step. The if _n_ <= 0.7 * nobs condition is used to determine whether an observation belongs to the training dataset (heart_train). If the condition is met, the observation is output to heart_train; otherwise, it's output to heart_test.

Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

0 Response to "SAS: Split Data into Training and Test Datasets"

Post a Comment

Next → ← Prev
Looks like you are using an ad blocker!

To continue reading you need to turnoff adblocker and refresh the page. We rely on advertising to help fund our site. Please whitelist us if you enjoy our content.