How do I use test and train datasets in SPSS?
Answer
In ALL cases below, replace _______ with the name of your binary dependent variable. No other changes in code should be necessary.
1. Create Testing and Training Groups
a. Create a random sample indicator.
This will set a seed for the random number generator and create a new variable with random numbers between 0 and 1.
SET SEED=123456. COMPUTE random_sample = RV.UNIFORM(0,1). EXECUTE.
b. Select 30% of the cases.
This will create a variable which contains the original value of you dependent variable, then replace that value with a system missing value (.) if the random number generated in the previous step was less than 0.3. This should be about 30% of the cases. If you want to use a different percentage of cases, change that value. The group of observations with the missing value become your "test" set, which is indicated by the test variable.
COMPUTE train_dv = __________ . IF (random_sample <= 0.3) train_dv = $SYSMIS. COMPUTE test = SYSMIS(train_dv). EXECUTE.
2. Prepare the logistic regression model
You will be re-doing your logistic regression with train_dv as your dependent variable instead. Copy the original code and make the following changes:
- Change the dependent variable to train_dv in the first line.
- Save Model Predictions by adding the line starting with /SAVE. If you already /SAVE, add PRED and PGROUP as shown.
LOGISTIC REGRESSION VARIABLES train_dv /SAVE=PRED PGROUP COOK ZRESID ....
Using the menu:
- In
Analyze > Regession > Binary Logistic
set up the regression you are evaluating with the following additions: - Put train_dv in the Dependent variable box.
-
Click
Save
and check Probabilities and Group membership in the Predicted Values group.
The /SAVE option will create some new variables, including:
- PGR_1 : Predicted Group
- PRE_1 : Predicted Probability
Testing Model Fit
Select the Testing Data
This will make SPSS use only those cases with a 1 in the test variable.
FILTER BY test. EXECUTE.
Using the menu:
- Go to
Data > Select Cases
. - Select Use filter variable.
- Put the test variable into the box.
- Keep Filter out unselected cases selected.
Classification Accuracy
- Compare these predictions with the actual outcomes in the test set.
- Calculate the percentage of correctly classified cases.
CROSSTABS /TABLES=__________ BY PGR_1 /FORMAT=AVALUE TABLES /CELLS=COUNT ROW /COUNT ROUND CELL.
Using the menu:
- Go to
Analyze > Descriptive Statistics > Crosstabs
. - Put the actual outcomes (original DV) in the Row and the predicted outcome/group (PGR_1) in the Column.
AUC (Area Under the Curve)
- See the AUC calculation in the output.
ROC PRE_1 BY __________(1) /PLOT=CURVE(REFERENCE) /PRINT=SE /CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95).
Using the menu:
- Go to
Analyze > Classify > ROC Curve
. - Put the predicted probabilities (PRE_1) into the Test Variable box.
- Put the actual outcomes (original DV) into the State Variable box.
- Type 1 in the Value of State Variable box.
- In the Display group, check With diagonal eference line and Standad eor and confidence interval
Hosmer–Lemeshow Test
- See the test result in the output
LOGISTIC REGRESSION VARIABLES _______ /METHOD=ENTER PRE_1 /PRINT=GOODFIT.
Using the menu:
- Go to
Analyze > Regression > Binary Logistic
. - Select the actual outcomes (original DV) as the dependent variable.
- Select the predicted probabilities (PRE_1) as the independent variable.
- Click
Options
and select Hosmer–Lemeshow goodness-of-fit.
Return to Normal
Turn off the filter:
USE ALL.
If desired, remove the extra variables:
DELETE VARIABLES test train_dv random_sample PRE_1 PGR_1.