This tutorial explains the various methods to split your data into training and test (or validation) datasets in SAS.
Splitting data into training and test datasets is crucial for developing, evaluating, and improving predictive models. The training dataset is used to build and train your predictive model. After training, the model's performance needs to be evaluated using unseen data. This is where the test dataset comes in. It is to identify and address issues such as overfitting, which occurs when a model performs exceptionally well on the training data but poorly on unseen (test) data.
Method 1: Simple Random Sampling using PROC SURVEYSELECT
By default, the PROC SURVEYSELECT procedure performs simple random sampling. The following code splits the dataset: 70% of the data goes into the training dataset, while the remaining 30% goes into the test dataset. It first performs simple random sampling on the sashelp.heart dataset using a 70% sampling rate, and then separates the selected observations into the heart_train dataset while placing the non-selected observations in the heart_test dataset.
proc surveyselect data=sashelp.heart rate=0.7 outall out=heart2 seed=1234; run; data heart_train heart_test; set heart2; if selected =1 then output heart_train; else output heart_test; drop selected; run;
seed=1234: This sets the random number seed for reproducibility. It ensures that the same split is generated in the random sampling every time you run the code.
To validate the split, you can run PROC FREQ to see the number of observations in these two datasets along with the distribution of dependent variable.
proc freq data=heart_train; table status; run; proc freq data=heart_test; table status; run;
If you observe the distribution (percent) of dependent variable (status), you would find it is not consistent in the training and test datasets. It is 38.5% in training dataset and 37.58% in test dataset. The difference is not huge but when working with imbalanced datasets, where one category is significantly more prevalent than others, simple random sampling for splitting data into training and test sets might result in a disproportionate representation of categories in the subsets. This can lead to biased model performance evaluation and potentially poor generalization to new data. Stratified sampling helps to address this issue by maintaining the same proportion of categories in both the training and test datasets as in the original dataset. We will cover this in the next section.
Method 2: Stratified Sampling using PROC SURVEYSELECT
The benefit of the stratified sampling is that it ensures that the distribution of your dependent variable remains the same when splitting data into training and test datasets.
The following SAS code performs a stratified random sampling using the PROC SURVEYSELECT procedure and then splitting the selected data into training and test datasets. This line strata status; indicates that the variable "status" is used as the stratification variable for sampling. This means that the sampling process will be performed within each unique value of "status". Make sure tha strata variable is sorted before running the PROC SURVEYSELECT procedure.
proc sort data= sashelp.heart out=heart; by status; run; proc surveyselect data=heart rate=0.7 outall out=heart2 seed=1234; strata status; run; data heart_train heart_test; set heart2; if selected =1 then output heart_train; else output heart_test; drop selected; run; proc freq data=heart_train; table status; run; proc freq data=heart_test; table status; run;
The distribution of status is same in both the training and test datasets. It is 38.22% in both the datasets.
Method 3: Simple Random Sampling using ranuni() Function
The following code shows how to split a dataset into training and testing datasets using the ranuni
function. This function is used to generate a random number for each observation, which is then used to shuffle the dataset before the split.
data heart2; set sashelp.heart; n=ranuni(1234); run; proc sort data=heart2; by n; run; data heart_train heart_test; set heart2 nobs=nobs; if _n_<=.7*nobs then output heart_train; else output heart_test; run;
The nobs option is used to obtain the total number of observations in the heart2 dataset. The _n_ variable represents the current observation number in the data step. The if _n_ <= 0.7 * nobs condition is used to determine whether an observation belongs to the training dataset (heart_train). If the condition is met, the observation is output to heart_train; otherwise, it's output to heart_test.
Post a Comment