This article covers some of the SAS Statistical Business Analyst certification questions with detailed answers. This certification covers some of the most widely used statistical techniques such as ANOVA, linear and logistic regression.
Which SAS program will divide the original data set into 60% training and 40% validation data sets, stratified by county?
A. Training: 50% Validation: 0% Testing: 50%
B. Training: 100% Validation: 0% Testing: 0%
C. Training: 0% Validation: 100% Testing: 0%
D. Training: 50% Validation: 50% Testing: 0%
A marketing campaign will send brochures describing an expensive product to a set of customers.
The cost for mailing and production per customer is $50. The company makes $500 revenue for each sale. What is the profit matrix for a typical person in the population?
What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data
prior to partitioning the data for honest assessment as opposed to performing the data cleansing
after partitioning the data?
Explanation: As you move along the ROC curve, you get more true positive (Sensitivity) but also more false positive (1-Specificity). It also changes the probability cutoff for scoring as the idea is to maximize the difference between True Positive and False Positive.
Explanation : Multicollinearity implies high correlation between independent variables. High multicollinearity inflates standard error of parameter estimates and makes the interpretation of estimates incorrect.
Which of the following is an assumption of ANOVA?
A. No correlation between any one observation with another.
B. No correlation between independent and dependent variable
C. No correlation between independent variables
D. High correlation between any one observation with another.
Answer : A
You have 50 observations in ANOVA and you calculate the residuals. What will they sum to?
Answer: D
Explanation : Oversampling does not affect sensitivity or specificity measures. It affects Intercept of a model.
A. Option A
B. Option B
C. Option C
D. Option D
Answer: A
Explanation: Profit = Revenue - Cost
Question 15
How c statistics is calculated :
A. percent concordant + (1.5* percent tied)
B. percent concordant + (0.5 * percent tied)
C. percent discordant + (0.5 * percent tied)
D. percent discordant + (1.5* percent tied)
Answer : B
Explanation : c statistics is also called AUC (Area under curve). See the example below -
SAS Output
Percent Concordant 82.3
Percent Discordant 17.5
Percent Tied 0.2
c statistics 0.824 =(82.3/100) + (0.5 * (0.2/100))
To crack the exam, candidates should prepare the following topics. The weightage assigned to each topic is mentioned below :
- Analysis of Variance (ANOVA) - 10%
- Linear Regression - 20%
- Logistic Regression - 25%
- Preparing Inputs for Predictive Models - 20%
- Measuring Model Performance - 25%
There would be 60 multiple-choice questions a candidate has to answer in 2 hours. A candidate must achieve a minimum 68% marks to pass the exam
Question 1
Which of the following two sampling methods are acceptable while splitting data into multiple samples - training, validation and test samples?
A. Simple random sampling without replacement
B. Simple random sampling with replacement
C. Stratified random sampling without replacement
D. Sequential random sampling with replacement
Answer : A, C
Explanation : When we split our data into 3 parts - training, validation and test, we perform sampling without replacement. It means a row can be selected only one time which would either move to training, validation or test sample. In other words, same row can never be found in more than one sample. The opposite of this is sampling with replacement. Why not sampling with replacement? If we perform sampling with replacement, we would not be able to assess model performance correctly because same data points that were used to train model exists in validation or test datasets. The explanation of Stratified Sampling is provided in the next question.
Question 2
Answer : C
Explanation : It is required to sort the variable you want to use to stratify sample before running PROC SURVEYSELECT.
Stratified Sampling helps to keep the initial ratio of events to non-events in both the training and validation data sets. It is important in the case of rare-event model. In this case, we are keeping initial ratio of country variable in both the training and validation sample.
Question 3
In order to perform honest assessment on a predictive model, which is an acceptable division between training, validation, and testing data?
Explanation : It is required to sort the variable you want to use to stratify sample before running PROC SURVEYSELECT.
Stratified Sampling helps to keep the initial ratio of events to non-events in both the training and validation data sets. It is important in the case of rare-event model. In this case, we are keeping initial ratio of country variable in both the training and validation sample.
Question 3
A. Training: 50% Validation: 0% Testing: 50%
B. Training: 100% Validation: 0% Testing: 0%
C. Training: 0% Validation: 100% Testing: 0%
D. Training: 50% Validation: 50% Testing: 0%
Answer : D
Explanation : There is no fixed optimal splitting rule. Some researchers use splitting rule - 70% training and 30% validation. Some use 60% training-20% validation -20% test. It is important to note that 20 to 50% of data should be used as a validation set in order to measure model performance.
Question 4
The cost for mailing and production per customer is $50. The company makes $500 revenue for each sale. What is the profit matrix for a typical person in the population?
Profit Matrix |
Answer : C
Explanation : It is 450 because $500 revenue was generated and $50 mailing cost was incurred when purchase was made and mail was sent. So, profit = 500 - 50 =450. Profit matrix is used to choose optimal predicted probability cutoff. It is more used rather than sensitivity or specificity to decide the cutoff. The optimal cutoff maximizes the total expected profit.
Question 5
prior to partitioning the data for honest assessment as opposed to performing the data cleansing
after partitioning the data?
A. It violates assumptions of the model.
B. It requires extra computational effort and time.
C. It omits the training (and test) data sets from the benefits of the cleansing methods.
D. There is no ability to compare the effectiveness of different cleansing methods.
B. It requires extra computational effort and time.
C. It omits the training (and test) data sets from the benefits of the cleansing methods.
D. There is no ability to compare the effectiveness of different cleansing methods.
Answer : D
Explanation : If we perform data cleaning before splitting data into training and validation datasets, we would not be able to compare models based on different imputations / transformations methods.
Question 6
ROC Curve |
As you move along the ROC curve, what changes?
A. The priors in the population
B. The true negative rate in the population
C. The proportion of events in the training data
D. The probability cutoff for scoring
Answer: D
Explanation: As you move along the ROC curve, you get more true positive (Sensitivity) but also more false positive (1-Specificity). It also changes the probability cutoff for scoring as the idea is to maximize the difference between True Positive and False Positive.
Question 7
How multicollinearity can affect the regression model?
A. Inflate Standard Error of Estimates
B. Deflate Standard Error of Estimates
C Does not affect the model
D Help interpreting Estimates
Answer : A
Explanation : Multicollinearity implies high correlation between independent variables. High multicollinearity inflates standard error of parameter estimates and makes the interpretation of estimates incorrect.
Question 8
Which of the following is an assumption of ANOVA?
A. No correlation between any one observation with another.
B. No correlation between independent and dependent variable
C. No correlation between independent variables
D. High correlation between any one observation with another.
Answer : A
Explanation : The most important assumption of ANOVA is independent observations. It implies the response value of one observation does not influence the response value of another.
Question 9
You have 50 observations in ANOVA and you calculate the residuals. What will they sum to?
A. 50
B. 2500
C. 0
D. -50
Answer : C
Explanation : The residuals always sum to 0 no matter the number of observations in your dataset.
Question 10
If you want to compare the average monthly salary of males and females, which of the following two statistical method should you choose?
A. two sample t-test
B. one sample t-test
C. two way ANOVA
D. one way ANOVA
Answer : A, D
Explanation :
You can use one-way ANOVA and two-sample t-test because you are comparing two groups, males and females. You can use two-way ANOVA when you have more than one independent variable.
Question 11
What values are not affected by oversampling in a rare event model?
A. Predicted Probabilities
B. Intercept
B. Intercept
C. Negative Predicted Value
D. Sensitivity and Specificity
Answer: D
Explanation : Oversampling does not affect sensitivity or specificity measures. It affects Intercept of a model.
Question 12
An analyst has a sufficient volume of data to perform a 3-way partition of the data into training,
validation, and test sets to perform honest assessment during the model building process.
An analyst has a sufficient volume of data to perform a 3-way partition of the data into training,
validation, and test sets to perform honest assessment during the model building process.
What is the purpose of the test data set?
A. To provide a unbiased measure of assessment for the final model.
B. To compare models and select and fine-tune the final model.
C. To reduce total sample size to make computations more efficient.
D. To build the predictive models.
A. To provide a unbiased measure of assessment for the final model.
B. To compare models and select and fine-tune the final model.
C. To reduce total sample size to make computations more efficient.
D. To build the predictive models.
Answer: A
Explanation : The test data set is used to assess model without any biaseness.
Question 13
An analyst generates a model using the LOGISTIC procedure. They are now interested in getting
the sensitivity and specificity statistics on a validation data set for a variety of cutoff values.
Which statement and option combination will generate these statistics?
A. Scoredata=valid1 out=roc;
B. Scoredata=valid1 outroc=roc;
C. mode1resp(event= '1') = gender region/outroc=roc;
D. mode1resp(event"1") = gender region/ out=roc;
Answer: B
Explanation: In PROC LOGISTIC, the OUTROC= option tells SAS to generate data for the ROC curve to the SAS data set named roc.
Question 14
Assume a $10 cost for soliciting a non-responder and a $200 profit for soliciting a responder. The logistic regression model gives a probability score named P_R on a SAS data set called VALID. The VALID data set contains the responder variable Purch, a 1/0 variable coded as 1 for responder. Customers will be solicited when their probability score is more than 0.05.
Question 14
Assume a $10 cost for soliciting a non-responder and a $200 profit for soliciting a responder. The logistic regression model gives a probability score named P_R on a SAS data set called VALID. The VALID data set contains the responder variable Purch, a 1/0 variable coded as 1 for responder. Customers will be solicited when their probability score is more than 0.05.
Which SAS program computes the profit for each customer in the data set VALID?
SAS Certified Statistical Business Analyst Questions |
A. Option A
B. Option B
C. Option C
D. Option D
Answer: A
Explanation: Profit = Revenue - Cost
Question 15
How c statistics is calculated :
A. percent concordant + (1.5* percent tied)
B. percent concordant + (0.5 * percent tied)
C. percent discordant + (0.5 * percent tied)
D. percent discordant + (1.5* percent tied)
Answer : B
Explanation : c statistics is also called AUC (Area under curve). See the example below -
SAS Output
Percent Concordant 82.3
Percent Discordant 17.5
Percent Tied 0.2
c statistics 0.824 =(82.3/100) + (0.5 * (0.2/100))
These questions were very useful. Thank you.
ReplyDeleteGood post. Many thanks
ReplyDeletegood one.. Thank you so much
ReplyDeleteThis is really helpful.I didn't find explanation anywhere.Thanks much buddy!
ReplyDeleteVery informative. Really enjoyed it! Thank you!
ReplyDeleteThe following LOGISTIC procedure output analyzes the relationship between a binary response and an ordinal predictor variable, wrist_size Using reference cell coding, the analyst selects Large (L) as the reference level.
ReplyDeleteAnalysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -1.0415 0.4749 4.8101 0.0283
Wrist_size M 1 1.1234 0.4989 5.0697 0.0243
Wrist_size S 1 1.6078 0.5478 8.6133 0.0033
What is the estimated log it for a person with large wrist size?
You can use calculator if needed.
Can you explain this?
Thanks
NS Khan, the question is asking for the estimated logit for Large Wrist size, which (as noted) is the reference level. Reference level is the Intercept in the logistic procedure output.
DeleteSo we can find the intercept in the Parameter column, and find it's logit estimated value in the Estimate column; and we find that the answer is -1.0415.
The question is testing:
1.) If you understand that the reference level = the intercept
2.) If you know in which output column to find the estimated logit.
Your content is highly valuable.
ReplyDeletevery nice info sir. good luck
ReplyDeletehun paigya kamzor m
ReplyDeleteInformative Content. Please Keep Sharing Thanks!
ReplyDelete