This article covers some of the SAS Statistical Business Analyst certification questions with detailed answers. This certification covers some of the most widely used statistical techniques such as ANOVA, linear and logistic regression.

Which SAS program will divide the original data set into 60% training and 40% validation data sets, stratified by county?

Stratified Sampling helps to keep the initial ratio of events to non-events in both the training and validation data sets. It is important in the case of rare-event model. In this case, we are keeping initial ratio of

In order to perform honest assessment on a predictive model, which is an acceptable division between training, validation, and testing data?

A. Training: 50% Validation: 0% Testing: 50%

B. Training: 100% Validation: 0% Testing: 0%

C. Training: 0% Validation: 100% Testing: 0%

D. Training: 50% Validation: 50% Testing: 0%

A marketing campaign will send brochures describing an expensive product to a set of customers.

The cost for mailing and production per customer is $50. The company makes $500 revenue for each sale. What is the profit matrix for a typical person in the population?

What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data

prior to partitioning the data for honest assessment as opposed to performing the data cleansing

after partitioning the data?

Which of the following is an assumption of ANOVA?

A. No correlation between any one observation with another.

B. No correlation between independent and dependent variable

C. No correlation between independent variables

D. High correlation between any one observation with another.

You have 50 observations in ANOVA and you calculate the residuals. What will they sum to?

An analyst has a sufficient volume of data to perform a 3-way partition of the data into training,

validation, and test sets to perform honest assessment during the model building process.

Assume a $10 cost for soliciting a non-responder and a $200 profit for soliciting a responder. The logistic regression model gives a probability score named P_R on a SAS data set called VALID. The VALID data set contains the responder variable Purch, a 1/0 variable coded as 1 for responder. Customers will be solicited when their probability score is more than 0.05.

A. Option A

B. Option B

C. Option C

D. Option D

Question 15

How c statistics is calculated :

A. percent concordant + (1.5* percent tied)

B. percent concordant + (0.5 * percent tied)

C. percent discordant + (0.5 * percent tied)

D. percent discordant + (1.5* percent tied)

Percent Concordant 82.3

Percent Discordant 17.5

Percent Tied 0.2

c statistics 0.824 =(82.3/100) + (0.5 * (0.2/100))

To crack the exam, candidates should prepare the following topics. The weightage assigned to each topic is mentioned below :

- Analysis of Variance (ANOVA)
**- 10%** - Linear Regression -
**20%** - Logistic Regression -
**25%** - Preparing Inputs for Predictive Models -
**20%** - Measuring Model Performance -
**25%**

**There would be 60 multiple-choice questions a candidate has to answer in 2 hours. A candidate must achieve a minimum 68% marks to pass the exam**

**Question 1**

Which of the following

**two**sampling methods are acceptable while splitting data into multiple samples - training, validation and test samples?
A. Simple random sampling without replacement

B. Simple random sampling with replacement

C. Stratified random sampling without replacement

D. Sequential random sampling with replacement

**Answer :**A, C

**Explanation :**When we split our data into 3 parts - training, validation and test, we perform

**sampling without replacement.**It means a row can be selected only one time which would either move to training, validation or test sample. In other words, same row can never be found in more than one sample. The opposite of this is

**sampling with replacement**. Why not sampling with replacement? If we perform sampling with replacement, we would not be able to assess model performance correctly because same data points that were used to train model exists in validation or test datasets. The explanation of

**Stratified Sampling**is provided in the next question.

**Question 2**

**Answer : C**

**Explanation :**It is required to sort the variable you want to use to stratify sample before running

**PROC SURVEYSELECT**.

Stratified Sampling helps to keep the initial ratio of events to non-events in both the training and validation data sets. It is important in the case of rare-event model. In this case, we are keeping initial ratio of

**country variable**in

**both the training and validation sample.**

**Question 3**

A. Training: 50% Validation: 0% Testing: 50%

B. Training: 100% Validation: 0% Testing: 0%

C. Training: 0% Validation: 100% Testing: 0%

D. Training: 50% Validation: 50% Testing: 0%

**Answer : D**

**Explanation :**There is no fixed optimal splitting rule. Some researchers use splitting rule - 70% training and 30% validation. Some use 60% training-20% validation -20% test. It is important to note that 20 to 50% of data should be used as a validation set in order to measure model performance.

**Question 4**

The cost for mailing and production per customer is $50. The company makes $500 revenue for each sale. What is the profit matrix for a typical person in the population?

Profit Matrix |

**Answer : C**

**Explanation :**It is 450 because $500 revenue was generated and $50 mailing cost was incurred when purchase was made and mail was sent. So, profit = 500 - 50 =450. Profit matrix is used to choose optimal predicted probability cutoff. It is more used rather than sensitivity or specificity to decide the cutoff. The optimal cutoff maximizes the total expected profit.

**Question 5**

prior to partitioning the data for honest assessment as opposed to performing the data cleansing

after partitioning the data?

A. It violates assumptions of the model.

B. It requires extra computational effort and time.

C. It omits the training (and test) data sets from the benefits of the cleansing methods.

D. There is no ability to compare the effectiveness of different cleansing methods.

B. It requires extra computational effort and time.

C. It omits the training (and test) data sets from the benefits of the cleansing methods.

D. There is no ability to compare the effectiveness of different cleansing methods.

**Answer : D**

**Explanation :**If we perform data cleaning before splitting data into training and validation datasets, we would not be able to compare models based on different imputations / transformations methods.

**Question 6**

ROC Curve |

As you move along the ROC curve, what changes?

A. The priors in the population

B. The true negative rate in the population

C. The proportion of events in the training data

D. The probability cutoff for scoring

**Answer: D**

**Explanation:**As you move along the ROC curve, you get more true positive (Sensitivity) but also more false positive (1-Specificity). It also changes the probability cutoff for scoring as the idea is to maximize the difference between True Positive and False Positive.

**Question 7**

How multicollinearity can affect the regression model?

A. Inflate Standard Error of Estimates

B. Deflate Standard Error of Estimates

C Does not affect the model

D Help interpreting Estimates

**Answer : A**

**Explanation :**Multicollinearity implies high correlation between independent variables. High multicollinearity inflates standard error of parameter estimates and makes the interpretation of estimates incorrect.

**Question 8**

Which of the following is an assumption of ANOVA?

A. No correlation between any one observation with another.

B. No correlation between independent and dependent variable

C. No correlation between independent variables

D. High correlation between any one observation with another.

**Answer : A**

**Explanation :**The most important assumption of ANOVA is independent observations. It implies the response value of one observation does not influence the response value of another.

**Question 9**

You have 50 observations in ANOVA and you calculate the residuals. What will they sum to?

A. 50

B. 2500

C. 0

D. -50

**Answer : C**

**Explanation :**The residuals always sum to 0 no matter the number of observations in your dataset.

**Question 10**

If you want to compare the average monthly salary of males and females, which of the following

**two**statistical method should you choose?
A. two sample t-test

B. one sample t-test

C. two way ANOVA

D. one way ANOVA

**Answer : A, D**

**Explanation :**

You can use one-way ANOVA and two-sample t-test because you are comparing two groups, males and females. You can use two-way ANOVA when you have more than one independent variable.

**Question 11**

What values are not affected by oversampling in a rare event model?

A. Predicted Probabilities

B. Intercept

B. Intercept

C. Negative Predicted Value

D. Sensitivity and Specificity

**Answer: D**

**Oversampling does not affect sensitivity or specificity measures. It affects Intercept of a model.**

Explanation :

Explanation :

**Question 12**

An analyst has a sufficient volume of data to perform a 3-way partition of the data into training,

validation, and test sets to perform honest assessment during the model building process.

What is the purpose of the test data set?

A. To provide a unbiased measure of assessment for the final model.

B. To compare models and select and fine-tune the final model.

C. To reduce total sample size to make computations more efficient.

D. To build the predictive models.

A. To provide a unbiased measure of assessment for the final model.

B. To compare models and select and fine-tune the final model.

C. To reduce total sample size to make computations more efficient.

D. To build the predictive models.

**Answer: A**

**Explanation :**The test data set is used to assess model without any biaseness.

**Question 13**

An analyst generates a model using the LOGISTIC procedure. They are now interested in getting

the sensitivity and specificity statistics on a validation data set for a variety of cutoff values.

Which statement and option combination will generate these statistics?

A. Scoredata=valid1 out=roc;

B. Scoredata=valid1 outroc=roc;

C. mode1resp(event= '1') = gender region/outroc=roc;

D. mode1resp(event"1") = gender region/ out=roc;

**Answer: B**

**Explanation:**In PROC LOGISTIC,

**the OUTROC= option tells SAS to generate data for the ROC curve to the SAS data set named**

**roc**.

**Question 14**

Assume a $10 cost for soliciting a non-responder and a $200 profit for soliciting a responder. The logistic regression model gives a probability score named P_R on a SAS data set called VALID. The VALID data set contains the responder variable Purch, a 1/0 variable coded as 1 for responder. Customers will be solicited when their probability score is more than 0.05.

Which SAS program computes the profit for each customer in the data set VALID?

SAS Certified Statistical Business Analyst Questions |

A. Option A

B. Option B

C. Option C

D. Option D

**Answer: A**

**Explanation:**Profit = Revenue - Cost

Question 15

How c statistics is calculated :

A. percent concordant + (1.5* percent tied)

B. percent concordant + (0.5 * percent tied)

C. percent discordant + (0.5 * percent tied)

D. percent discordant + (1.5* percent tied)

**Answer : B**

**Explanation :**c statistics is also called AUC (Area under curve). See the example below -

**SAS Output**

Percent Concordant 82.3

Percent Discordant 17.5

Percent Tied 0.2

c statistics 0.824 =(82.3/100) + (0.5 * (0.2/100))

These questions were very useful. Thank you.

ReplyDeleteGood post. Many thanks

ReplyDeletegood one.. Thank you so much

ReplyDeleteThis is really helpful.I didn't find explanation anywhere.Thanks much buddy!

ReplyDeleteVery informative. Really enjoyed it! Thank you!

ReplyDelete