In the section shown below you describe p0 and p1 but do not reference them in the calculations. Are they supposed to be in the formulas? Thanks.

Correcting Confusion Matrix

Suppose, π0 is the proportion of non-events before sampling . π1 is the proportion of events before sampling. ρ1 is the proportion of events after sampling. ρ0 is the proportion of non-events after sampling.
True proportion of true positives = π1 * sensitivity.
True proportion of true negatives = π0 * specificity
True proportion of false positives = π0 * (1 - specificity)
True proportion of false negatives = π1 * (1 - sensitivity)

Hi Deepanshu,
Does this mean that you oversample AFTER you split your train and validation data?

Can't weight option of proc logistic be used to handle such cases?

Nice work Deepanshu. I just had a small question. Could you please elaborate on why does the beta coefficients of the covariates not change after the oversampling?

Hi Deepanshu, If I have a case where I am using sample of 150k from the base and my churn rate is 1%, so 1500 cases of churners (events), do I really need to oversample if I am testing around 30 variables and final model has <20 variables. Also, as my probabilities are very low, my confusion matrix is super screwed at 0.4 cut off. How do I explain this?

Cheers Deepanshu

yes, your understanding is correct. Low event rate does not matter if you have enough events dependending on the number of variables. This rule applies only to Logistic Regression. It's not safe to generalize for all the algorithms.

Thanks, this is useful.
Another question, does event rate matter if you have enough volume of events in the model? I am working on Churn model for telecom (as you have given the example), churn (event) rate is 0.7% but I have around 10,000 event volume for around 1 million observations. I am am testing around 20 variables in the model and final model has around 10 variables. My understanding is that if you have enough Event volume like in this case 10K, based on number of independent variables, low event rate should not matter?

Yes, priorevent = 0.016 is correct. The idea of using validation dataset is to validate the model and fitting equation derived from the training dataset on validation dataset. You have built your model on training data and now you are checking whether model works well on data outside training. If you do oversampling on validation data as well, it would NOT be a right method of validation of your model. It is because the real desired outcome rate (event rate) is 1.6% which you are trying to predict for the future population. Hope it helps!

Hi, I have come across similar problem where I have 1.4 % churn rate (event) for around 3 million obs. I have taken 50-50 (all events and some non events). So in this case is it correct to use priorevent=0.016 in the score statement ( because my event rate was 1.6% before over sampling )?. Another question, if I do oversampling on training data and NOT on validation data, wouldn't event rate be very low in the validation dataset for sas to do validation? Many thanks.

Very well explained. Thanks

Deepanshu it helped a lot!!