How do I use test and train datasets in SPSS?

Answer

In ALL cases below, replace _______ with the name of your binary dependent variable. No other changes in code should be necessary.

1. Create Testing and Training Groups

a. Create a random sample indicator.

This will set a seed for the random number generator and create a new variable with random numbers between 0 and 1.

SET SEED=123456. 
COMPUTE random_sample = RV.UNIFORM(0,1).
EXECUTE.

b. Select 30% of the cases.

This will create a variable which contains the original value of you dependent variable, then replace that value with a system missing value (.) if the random number generated in the previous step was less than 0.3. This should be about 30% of the cases. If you want to use a different percentage of cases, change that value. The group of observations with the missing value become your "test" set, which is indicated by the test variable.

COMPUTE train_dv = __________ .
IF (random_sample <= 0.3) train_dv = $SYSMIS.
COMPUTE test = SYSMIS(train_dv).
EXECUTE.

2. Prepare the logistic regression model

You will be re-doing your logistic regression with train_dv as your dependent variable instead. Copy the original code and make the following changes:

Change the dependent variable to train_dv in the first line.
Save Model Predictions by adding the line starting with /SAVE. If you already /SAVE, add PRED and PGROUP as shown.

LOGISTIC REGRESSION VARIABLES train_dv 
/SAVE=PRED PGROUP COOK ZRESID 
....

Using the menu:

In Analyze > Regession > Binary Logistic set up the regression you are evaluating with the following additions:
Put train_dv in the Dependent variable box.
Click Save and check Probabilities and Group membership in the Predicted Values group.

The /SAVE option will create some new variables, including:

PGR_1 : Predicted Group
PRE_1 : Predicted Probability

Testing Model Fit

Select the Testing Data

This will make SPSS use only those cases with a 1 in the test variable.

FILTER BY test.
EXECUTE.

Using the menu:

Go to Data > Select Cases.
Select Use filter variable.
Put the test variable into the box.
Keep Filter out unselected cases selected.

Classification Accuracy

Compare these predictions with the actual outcomes in the test set.
Calculate the percentage of correctly classified cases.

CROSSTABS
  /TABLES=__________ BY PGR_1
  /FORMAT=AVALUE TABLES
  /CELLS=COUNT ROW  
  /COUNT ROUND CELL.

Using the menu:

Go to Analyze > Descriptive Statistics > Crosstabs.
Put the actual outcomes (original DV) in the Row and the predicted outcome/group (PGR_1) in the Column.

AUC (Area Under the Curve)

See the AUC calculation in the output.

ROC PRE_1 BY __________(1)
  /PLOT=CURVE(REFERENCE)
  /PRINT=SE
  /CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95).

Using the menu:

Go to Analyze > Classify > ROC Curve.
Put the predicted probabilities (PRE_1) into the Test Variable box.
Put the actual outcomes (original DV) into the State Variable box.
Type 1 in the Value of State Variable box.
In the Display group, check With diagonal eference line and Standad eor and confidence interval

Hosmer–Lemeshow Test

See the test result in the output

LOGISTIC REGRESSION VARIABLES _______
  /METHOD=ENTER PRE_1
  /PRINT=GOODFIT.

Using the menu:

Go to Analyze > Regression > Binary Logistic.
Select the actual outcomes (original DV) as the dependent variable.
Select the predicted probabilities (PRE_1) as the independent variable.
Click Options and select Hosmer–Lemeshow goodness-of-fit.

Return to Normal

Turn off the filter:

USE ALL.

If desired, remove the extra variables:

DELETE VARIABLES test train_dv random_sample PRE_1 PGR_1.

Last Updated Apr 29, 2025
Views 21
Answered By Debby

Was this helpful? 0 0

Contact Us

datahelp@gmu.libanswers.com

Submit a Question

Ask Mason Libraries: Data & Digital Scholarship