Splines in Stata traj models

On my prior trajectories using Stata post, Nandita Krishnan asks if we can estimate trajectory groups using linear splines (instead of polynomials). I actually prefer restricted cubic splines over polynomials, so it wouldn’t bother me if people did these as a default replacement for the polynomial functions. Also see this paper by Francis et al. using b-splines.

In the Stata traj library, you can just use time varying covariates to swap out the higher order polynomials with the spline effects (whether linear, non-linear, cyclic, etc.) you want. Below is a simple example using linear splines.

Example Linear Spline

First as a note, I am not the author of the traj library, several people have commented about issues installing on my prior blog post. Please email questions to Bobby Jones and/or Dan Nagin. The only update to note, when I updated Stata to V17 and re-installed traj, I needed to specify a “https” site instead of a “http” one:

net from https://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

That is about the only potential hint I can give if you are having troubles installing the add-on module. (You could download all the files yourself and install locally I imagine as well.)

Now here I make a simple simulated dataset, with a linear bend in one group and the other a simple line (with some error for both):

clear
set more off
set seed 10
set obs 2000
generate caseid = _n
generate group = ceil(caseid/1200) - 1

forval i = 1/10 {
  generate y_`i' = 5 + 0.5*`i' + rnormal(0,0.2) if group==0
  replace y_`i' = 2 + 0.5*`i' + rnormal(0,0.4) if group==1 & `i'<6
  replace y_`i' = 4.5 + -0.2*(`i'-5) + rnormal(0,0.4) if group==1 & `i'>=6
}

* Check out the trajectories
preserve
reshape long y_ , i(caseid) j(t)
graph twoway scatter y t, c(L) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

So now we are going to make our spline variables. It is actually easier in long format than wide. Here I do a linear spline, although in Stata you can do restricted cubic splines as well (just google “mkspline Stata” to get the nice PDF docs). So I make the two spline variables, and then just reshape back into wide format:

mkspline s1_ 5 s2_=t, marginal
reshape wide y_ s1_ s2_, i(caseid) j(t)

And now we can estimate our trajectory model, including our additional s2_* linear spline as a time-varying covariate:

traj, var(y_*) indep(s1_*) model(cnorm) min(0) max(20) order(1 1) sigmabygroup tcov(s2_*)

And here is the output:

==== traj stata plugin ====  Jones BL  Nagin DS,  build: Mar 17 2021
 
2000 observations read.
2000 observations used in the trajectory model.
 
                        Maximum Likelihood Estimates
                        Model: Censored Normal (cnorm)

                                       Standard       T for H0:
 Group   Parameter        Estimate        Error     Parameter=0   Prob > |T|
 
 1       Intercept         1.99290      0.01432         139.183       0.0000
         Linear            0.50328      0.00395         127.479       0.0000
         s2_1             -0.70311      0.00629        -111.781       0.0000
 
 2       Intercept         4.98822      0.00581         859.016       0.0000
         Linear            0.50310      0.00160         314.224       0.0000
         s2_1             -0.00382      0.00255          -1.498       0.1341
 
 1       Sigma             0.40372      0.00319         126.466       0.0000
 2       Sigma             0.20053      0.00129         154.888       0.0000
 
  Group membership
 1       (%)              40.00001      1.09566          36.508       0.0000
 2       (%)              59.99999      1.09566          54.761       0.0000
 
 BIC= -3231.50 (N=20000)  BIC= -3221.14 (N=2000)  AIC= -3195.94  ll=  -3186.94

 Entropy = unknown function /log()

Not sure about the unknown function log error message. But otherwise is what we expected. Surprisingly to me trajplot works out of the box:

trajplot

I figured I would need to write matrix code to get this plot, but no issues! If you are wondering why is the estimated spline effect here ~-0.7 instead of -0.2 (how I simulated the data), it is because these are cumulative effects, not independent ones. To get the -0.2, you can do post-hoc Wald tests:

matrix list e(b)
lincom _b[s2_1G1] + _b[linear1]

Which gives as output:

. matrix list e(b)

e(b)[1,9]
       interc1     linear1      s2_1G1     interc2     linear2      s2_1G2      sigma1      sigma2      theta2
y1   1.9928963   .50327955  -.70311464   4.9882231   .50310322  -.00382206   .40372185   .20052795   .40546454

. 
. lincom _b[s2_1G1] + _b[linear1]

 ( 1)  linear1 + s2_1G1 = 0

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |  -.1998351    .003097   -64.52   0.000    -.2059051    -.193765
------------------------------------------------------------------------------

While this is for a linear spline, more complicated cubic splines (or whatever you want, you could do cyclic splines if you have repeating data for example) will just result in more variables going into the tcov() parameters.

Note a major limitation of this, you need to specify the spline knot locations upfront. The method does not do a search for the knots – here I needed to tell mkspline to set a knot location at t=5. I think this is an OK limitation, as specifying knots at regular locations along the domain of the data can approximate many non-linear functions, and it is not much different than assuming a specific number of polynomials (but will tend to behave better in the tails).

Alternate models search for the changepoints themselves, but that would be tough to extend to mixture models (quite a few unknown parameters), better to either have flexible fixed splines, or just go for a continuous multi-level model formulation instead of discrete mixture classes.

Graphing Spline Predictions in SPSS

I might have around 10 blog posts about using splines in regression models – and you are about to get another. Instead of modeling non-linear effects via polynomial terms (e.g. including x^2, x^3 in a model, etc.), splines are a much better default procedure IMO. For a more detailed mathy exposition on splines and a walkthrough of the functions, see my class notes.

So I had a few questions about applying splines in generalized linear models and including control variables in my prior post (on a macro to estimate the spline terms). These include can you use them in different types of generalized linear models (yes), can you include other covariates into the model (yes). For either of those cases, interpreting the splines are more difficult though. I am going to show an example here of how to do that.

Additionally I have had some recent critiques of my paper on CCTV decay effects. One is that the locations of the knots we chose in that paper is arbitrary. So while that is true, one of the reasons I really like splines is that they are pretty robust – you can mis-specify the knot locations, and if you have enough of them they will tend to fit quite a few non-linear functions. (Also a note on posting pre-prints, despite being rejected twice and under review for around 1.5 years, it has over 2k downloads and a handful of citations. The preprint has more downloads than my typical published papers do.)

So here I am going to illustrate these points using some simulated data according to a particular logistic regression equation. So I know the true effect, and will show how mis-located spline knots still recovers the true effect quite closely. This example is in SPSS, and uses my macro on estimating spline basis.

Generating Simulated Data

So first in SPSS, I define the location where I am going to save my files. Then I import my Spline macro.

* Example of splines for generalized linear models 
* and multiple variables.

DATASET CLOSE ALL.
OUTPUT CLOSE ALL.

* Spline Macro.
FILE HANDLE macroLoc /name = "C:\Users\andre\OneDrive\Desktop\Spline_SPSS_Example".
INSERT FILE = "macroLoc\MACRO_RCS.sps".

Second, I create a set of synthetic data, in which I have a linear changepoint effect at x = 0.42. Then I generate observations according to a particular logistic regression model, with not only the non-linear X effects, but also two covariates Z1 (a binary variable) and Z2 (a continuous variable).

*****************************************************.
* Synthetic data.
SET SEED = 10.
INPUT PROGRAM.
LOOP Id = 1 to 10000.
END CASE.
END LOOP.
END file.
END INPUT PROGRAM.
DATASET NAME Sim.

COMPUTE X = RV.UNIFORM(0,1).
COMPUTE #Change = 0.42.
DO IF X <= #Change.
  COMPUTE XDif = 0.
ELSE.
  COMPUTE XDif = X - #Change.
END IF.
COMPUTE Z1 = RV.BERNOULLI(0.5).
COMPUTE Z2 = RV.NORMAL(0,1).  

DEFINE !INVLOGIT (!POSITIONAL  !ENCLOSE("(",")") ) 
1/(1 + EXP(-!1))
!ENDDEFINE.

*This is a linear changepoint at 0.42, other variables are additive.
COMPUTE ylogit = 1.1 + -4.3*x + 2.4*xdif + -0.4*Z1 + 0.2*Z2.
COMPUTE yprob = !INVLOGIT(ylogit).
COMPUTE Y = RV.BERNOULLI(yprob).
*These are variables you won't have in practice.
ADD FILES FILE =* /DROP ylogit yprob XDif.
FORMATS Id (F9.0) Y Z1 (F1.0) X Z2 (F3.2).
EXECUTE.
*****************************************************.

Creating Spline Basis and Estimating a Model

Now like I said, the correct knot location is at x = 0.42. Here I generate a set of regular knots over the x input (which varies from 0 to 1), at not the exact true value for the knot.

!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

Now if you look at your dataset, there are 3 new splinex? variables. (For restricted cubic splines, you get # of knots - 2 new variables, so with 5 knots you get 3 new variables here.)

We are then going to use those new variables in a logistic regression model. We are also going to save our model results to an xml file. This allows us to use that model to score a different dataset for predictions.

GENLIN Y (REFERENCE=0) WITH X splinex1 splinex2 splinex3 Z1 Z2 
  /MODEL X splinex1 splinex2 splinex3 Z1 Z2 
      INTERCEPT=YES DISTRIBUTION=BINOMIAL LINK=LOGIT
  /OUTFILE MODEL='macroLoc\LogitModel.xml'. 

And if we look at the coefficients, you will see that the coefficients look offhand very close to the true coefficients, minus splinex2 and splinex3. But we will show in a second that those effects should be of no real concern.

Generating New Data and Plotting Predictions

So you should do this in general with generalized linear models and/or non-linear effects, but to interpret spline effects you can’t really look at the coefficients and know what those mean. You need to make plots to understand what the non-linear effect looks like.

So here in SPSS, I create a new dataset, that has a set of regularly sampled locations along X, and then set the covariates Z1=1 and Z2=0. These set values you may choose to be at some average, such as mean, median, or mode depending on the type of covariate. So here since Z1 can only take on values of 0 and 1, it probably doesn’t make sense to choose 0.5 as the set value. Then I recreate my spline basis functions using the exact sample macro call I did earlier.

INPUT PROGRAM.
LOOP #xloc = 0 TO 300.
  COMPUTE X = #xloc/300.
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Fixed.
COMPUTE Z1 = 1.
COMPUTE Z2 = 0.
EXECUTE.
DATASET ACTIVATE Fixed.

*Redoing spline variables.
!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

Now in SPSS, we score this dataset using our prior model xml file we saved. Here this generates the predicted probability from our logistic model.

MODEL HANDLE NAME=LogitModel FILE='macroLoc\LogitModel.xml'. 
COMPUTE PredPr = APPLYMODEL(LogitModel, 'PROBABILITY', 1).
EXECUTE.
MODEL CLOSE NAME=LogitModel.

And to illustrate how close our model is, I generate what the true predicted probability should be based on our simulated data.

*Lets also do a line for the true effect to show how well it fits.
COMPUTE #change = 0.42.
DO IF X <= #change.
  COMPUTE xdif = 0.
ELSE.
  COMPUTE xdif = (X - #change).
END IF.
EXECUTE.
COMPUTE ylogit = 1.1 + -4.3*x + 2.4*xdif + -0.4*Z1 + 0.2*Z2.
COMPUTE TruePr = !INVLOGIT(ylogit).
FORMATS TruePr PredPr X (F2.1).
EXECUTE.

And now we can put these all into one graph.

DATASET ACTIVATE Fixed.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PredPr TruePr
  /FRAME INNER=YES
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: PredPr=col(source(s), name("PredPr"))
  DATA: TruePr=col(source(s), name("TruePr"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: cat(aesthetic(aesthetic.shape), map(("PredPr",shape.solid),("TruePr",shape.dash)))
  ELEMENT: line(position(X*PredPr), shape("PredPr"))
  ELEMENT: line(position(X*TruePr), shape("TruePr")) 
END GPL.

So you can see that even though I did not choose the correct knot location, my predictions are nearly spot on with what the true probability should be.

Generating Predictions Over Varying Inputs

So in practice you can do more complicated models with these splines, such as allowing them to vary over different categories (e.g. interactions with other covariates). Or you may simply want to generate predicted plots such as above, but have a varying set of inputs. Here is an example of doing that; for Z1 we only have two options, but for Z2, since it is a continuous covariate we sample it at values of -2, -1, 0, 1, 2, and generate lines for each of those predictions.

*****************************************************.
* Can do the same thing, but vary Z1/Z2.

DATASET ACTIVATE Sim.
DATASET CLOSE Fixed.

INPUT PROGRAM.
LOOP #xloc = 0 TO 300.
  LOOP #z1 = 0 TO 1.
    LOOP #z2 = -2 TO 2.
      COMPUTE X = #xloc/300.
      COMPUTE Z1 = #z1.
      COMPUTE Z2 = #z2.
      END CASE.
    END LOOP.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Fixed.
EXECUTE.
DATASET ACTIVATE Fixed.

*Redoing spline variables.
!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

MODEL HANDLE NAME=LogitModel FILE='macroLoc\LogitModel.xml'. 
COMPUTE PredPr = APPLYMODEL(LogitModel, 'PROBABILITY', 1).
EXECUTE.
MODEL CLOSE NAME=LogitModel.

FORMATS Z1 Z2 (F2.0) PredPr X (F2.1).
VALUE LABELS Z1
  0 'Z1 = 0'
  1 'Z1 = 1'.
EXECUTE.

*Now creating a graph of the predicted probabilities over various combos.
*Of input variables.
DATASET ACTIVATE Fixed.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PredPr Z1 Z2
  /FRAME INNER=YES
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: PredPr=col(source(s), name("PredPr"))
  DATA: TruePr=col(source(s), name("TruePr"))
  DATA: Z1=col(source(s), name("Z1"), unit.category())
  DATA: Z2=col(source(s), name("Z2"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Predicted Probability"))
  GUIDE: axis(dim(3), opposite())
  GUIDE: legend(aesthetic(aesthetic.color), label("Z2"))
  SCALE: cat(aesthetic(aesthetic.color), map(("-2",color."8c510a"),("-1",color."d8b365"),
               ("0",color."f6e8c3"), ("1",color."80cdc1"), ("2",color."018571")))
  ELEMENT: line(position(X*PredPr*Z1), color(Z2))
END GPL.
*****************************************************.

So between all of these covariates, the form of the line does not change much (as intended, I simulated the data according to an additive model).

If you are interested in drawing more lines for Z2, you may want to use a continuous color scale instead of a categorical one (see here for a similar example).