For my advanced research design course this semester I have been providing code snippets in Stata and R. This is the first time I’ve really sat down and programmed extensively in Stata, and this is a followup to produce some of the same plots and model fit statistics for group based trajectory statistics as this post in R. The code and the simulated data I made to reproduce this analysis can be downloaded here.

First, for my own notes my version of Stata is on a server here at UT Dallas. So I cannot simply go

```
net from http://www.andrew.cmu.edu/user/bjones/traj
net install traj, force
```

to install the group based trajectory code. First I have this set of code in the header of all my do files now

```
*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns"
*to install on your own in the lab it would be
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns"
```

So after running that then I can install the `traj`

command, and Stata will know where to look for it!

Once that is taken care of after setting the working directory, I can simply load in the csv file. Here my first variable was read in as `Ã¯id`

instead of `id`

(I’m thinking because of the encoding in the csv file). So I rename that variable to `id`

.

```
*Load in csv file
import delimited GroupTraj_Sim.csv
*BOM mark makes the id variable weird
rename Ã¯id id
```

Second, the `traj`

model expects the data in wide format (which this data set already is), and has counts in `count_1`

, `count_2`

… `count_10`

. The `traj`

command also wants you to input a time variable though to, which I do not have in this file. So I create a set of `t_`

variables to mimic the counts, going from 1 to 10.

```
*Need to generate a set of time variables to pass to traj, just label 1 to 10
forval i = 1/10 {
generate t_`i' = `i'
}
```

Now we can estimate our group based models, and we get a pretty nice default plot.

```
traj, var(count_*) indep(t_*) model(zip) order(2 2 2) iorder(0)
trajplot
```

Now for absolute model fit statistics, there are the average posterior probabilities, the odds of correct classification, and the observed classification proportion versus the expected classification proportion. Here I made a program that I will surely be ashamed of later (I should not brutalize the data and do all the calculations in matrix), but it works and produces an ugly table to get us these stats.

```
*I made a function to print out summary stats
program summary_table_procTraj
preserve
*updating code to drop missing assigned observations
drop if missing(_traj_Group)
*now lets look at the average posterior probability
gen Mp = 0
foreach i of varlist _traj_ProbG* {
replace Mp = `i' if `i' > Mp
}
sort _traj_Group
*and the odds of correct classification
by _traj_Group: gen countG = _N
by _traj_Group: egen groupAPP = mean(Mp)
by _traj_Group: gen counter = _n
gen n = groupAPP/(1 - groupAPP)
gen p = countG/ _N
gen d = p/(1-p)
gen occ = n/d
*Estimated proportion for each group
scalar c = 0
gen TotProb = 0
foreach i of varlist _traj_ProbG* {
scalar c = c + 1
quietly summarize `i'
replace TotProb = r(sum)/ _N if _traj_Group == c
}
gen d_pp = TotProb/(1 - TotProb)
gen occ_pp = n/d_pp
*This displays the group number [_traj_~p],
*the count per group (based on the max post prob), [countG]
*the average posterior probability for each group, [groupAPP]
*the odds of correct classification (based on the max post prob group assignment), [occ]
*the odds of correct classification (based on the weighted post. prob), [occ_pp]
*and the observed probability of groups versus the probability [p]
*based on the posterior probabilities [TotProb]
list _traj_Group countG groupAPP occ occ_pp p TotProb if counter == 1
restore
end
```

This should work after any model as long as the naming conventions for the assigned groups are `_traj_Group`

and the posterior probabilities are in the variables `_traj_ProbG*`

. So when you run

`summary_table_procTraj`

You get this ugly table:

| _traj_~p countG groupAPP occ occ_pp p TotProb | |-----------------------------------------------------------------------| 1. | 1 103 .9379318 43.57342 43.60432 .2575 .2573645 | 104. | 2 136 .9607258 47.48513 48.30997 .34 .3361462 | 240. | 3 161 .9935605 229.0413 225.2792 .4025 .4064893 |

The `groupAPP`

are the average posterior probabilities – here you can see they are all quite high. `occ`

is the odds of correct classification, and again they are all quite high. Update: Jeff Ward stated that I should be using the weighted posterior proportions for the OCC calculation, not the proportions based on the max. post. probability (that I should be using `TotProb`

in the above table instead of `p`

). So I have updated to include an additional column, `occ_pp`

based on that suggestion. I will leave `occ`

in though just to keep a paper trail of my mistake.

`p`

is the proportion in each group based on the assignments for the maximum posterior probability, and the `TotProb`

are the expected number based on the sums of the posterior probabilities. `TotProb`

should be the same as in the Group Membership part at the bottom of the `traj`

model. They are close (to 5 decimals), but not exactly the same (and I do not know why that is the case).

Next, to generate a plot of the individual trajectories, I want to reshape the data to long format. I use `preserve`

in case I want to go back to wide format later and estimate more models. If you need to look to see how the `reshape`

command works, type `help reshape`

at the Stata prompt. (Ditto for help on all Stata commands.)

```
preserve
reshape long count_ t_, i(id)
```

To get the behavior I want in the plot, I use a scatter plot but have them connected via `c(L)`

. Then I create small multiples for each trajectory group using the `by()`

option. Before that I slightly jitter the count data, so the lines are not always perfectly overlapped. I make the lines thin and grey — I would also use transparency but Stata graphs do not support this.

```
gen count_jit = count_ + ( 0.2*runiform()-0.1 )
graph twoway scatter count_jit t_, c(L) by(_traj_Group) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)
```

I’m too lazy at the moment to clean up the axis titles and such, but I think this plot of the individual trajectories should always be done. See *Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond* (Erosheva et al., 2014).

While this fit looks good, this is not the correct number of groups given how I simulated the data. I will give those trying to find the right answer a few hints; none of the groups have a higher polynomial than 2, and there is a constant zero inflation for the entire sample, so `iorder(0)`

will be the correct specification for the zero inflation part. If you take a stab at it let me know, I will fill you in on how I generated the simulation.

## mario spiezio

/ April 19, 2018Dear Prof Wheeler, thanks a lot for the code. It is really helpful. I was wondering whether it is possible to use margins after after traj to compute average marginal effects of the covariates included in risk(). Thank you.

## apwheele

/ April 19, 2018I don’t think so, it appears you will need to calculate that effect yourself. I’m not sure exactly how you would do it with the mixture model.

## Dimi

/ May 7, 2018Dear Prof Wheeler,

Thank you for the code.

For my research, I am trying to use the traj plugin for a survival analysis. As I am not a statistician (I am a physician) you can imagine I need a little help.

My initial issue is that, although I do not have missing values in the long format, in the wide format (due to the fact that some patients died during the follow-up) I do. When I run the code, these patients with “missing values” are dropped from the analysis. Do you know how to overcome this problem?

All my best,

Dimi

## apwheele

/ August 13, 2018I am not sure the best way to deal with missing data in such models.

## lori

/ August 12, 2018can you covary outcome group when you have a risk variable?

## apwheele

/ August 13, 2018Not exactly sure what you mean — are you asking about a joint trajectory model with multiple outcomes and predicting probability of being in a particular trajectory?

Also to see various examples you can type

help traj

into Stata and it provides various examples.

## lori

/ August 13, 2018I have an assessment that is completed across 4 different time points – 12, 18, 24, and 36 months. I am putting scores on a different assessment completed at 6 months as a risk variable. Can I also covary outcome group along with this risk marker to control for diagnosis at 36 months – the outcome group)?

## apwheele

/ August 13, 2018I think that would just be something like

traj, var(y1-y4) model(logit) order(2 2 2 2) risk(pre)

if I am understanding correctly. [Sorry if I am not!]

## lori

/ August 13, 2018My formulas is traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss)

When i try to covary for outcome group at the last time point, I get:

. traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss) cov(dxgroup)

option cov() not allowed

r(198);

. traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss) tcov(dxgroup)

The number of variables in tcov1 and var1 must match or be a multiple.

r(198);

var has 4 time points, dxgroup has 1 time point

## apwheele

/ August 13, 2018tcov is for time-varying covariates. If it is not time varying, it can be used to predict what trajectory group a person in likely to be in, but not the shape of the trajectory over time. Not time-varying covariates should probably go in the “risk()” option in most cases then.

What exactly is “dxgroup”?

## lori

/ August 13, 2018dxgroup is the outcome group. We collect data at 12, 18, 24, and 36 months. Then at 36 months, we do a diagnostic check to see if any of our kids end up with a diagnosis of autism. I want to be able to add this into my formula to covariate out any effects of outcome group to check if my predictor variable (an assessment completed at 6 months of age) can predict trajectory membership above and beyond outcome group.

## apwheele

/ August 13, 2018Given that description your formula should be:

traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss dxgroup)

with only one group though this is probably not identified. So you will need to change it to something like:

“order(2 2)” for two trajectory groups or

“order(2 2 2)” for three trajectory groups etc.

(With only four time points I would not do a cubic, about the best you can do is a quadratic).

## lori

/ August 13, 2018If you have two risk variables, does it matter what order you put them in, because if i put the 6 month assessment first followed by dxgroup, i get different information.

## apwheele

/ August 13, 2018If you mean should

“risk(mu6cmss dxgroup)”

give a different result than

“risk(dxgroup mu6cmss)”

it should not. One caveat is that the labelling of the trajectory groups is arbitrary, so they could in theory converge to the same estimates, but different groups get different labels (and subsequently the coefficients predicting trajectory membership may then be altered).

Given that the mixture models have difficulty converging that could also be a culprit as well.

This is a bit awkward to give so many comments, just send me an email and it will be easier for me to respond.

## Yong Kyu Lee

/ August 16, 2018Thank you for the thorough review.

I had a great help from it.

I usually use SAS, so I am not familiar with STATA, but by some political reason I have to do GBTM in STATA.

And I am wondering a way to get a file, as in ‘SAS ods output’, of the result of this GBTM which is ID and designated group by posterior probability.

## apwheele

/ August 17, 2018You can just save the results to a csv file. After you run the traj command, something like:

********************

preserve

keep ID _traj*

outsheet using “TrajResults.csv:, replace comma

restore

********************

## Risha Gidwani-Marszowski

/ July 18, 2019Thanks so much for this very helpful walkthrough.

I am running the following code with the -traj- program in Stata looking at monthly cost trajectories. Here, t_1-t_12 represent 12 months:

*********************************************************************************************

traj, var(total_cost*) indep(t_*) model(cnorm) min(0) max(748938.3) order (2 2 2)

**********************************************************************************************

I get the following error:

“total_cost2 is not within min1 = 0 and max1 = 748938.3”

The problem is, the maximum value for total_cost2 is actually $748,938.3. So the error doesn’t make sense to me. If I increase the max to 748938.4, I get another error of “unrecognized command.”

Question: How can I avoid getting the top error?

I tried making the “min” and “max” values that exceed the range of the actual cost values in the entire 12-month dataset and that didn’t work.

I tried specifying min(0) and min1…min(12), with the values for min 1…12 being the actual maximum values in the dataset for those months, and got an error for max(7), suggesting I cannot specify past max(6).

## apwheele

/ July 22, 2019Sorry for the late approval — vacation last week! I am not sure about that error, the code looks correct to me.

If it is too large of numbers for whatever reason, you can divide by all the dependent variables by a constant (e.g. divide by 10,000). If that does not work I might send Bobby Jones an email with your data and a reproducible example and see if he can give any advice.

## A.Carlsen

/ September 11, 2019I have the exact same problem. Did you find a solution to the problem?

## Heine Strand

/ August 9, 2019Hi,

I wonder about the BIC and AIC stats in traj and how to decide the number of groups and slopes based on them. We run a model where we follow dementia patients’ cognition over 8 years. A model with two groups with zero slope provided this stats:

0,0: BIC= -3980.05 (N=1554) BIC= -3977.56 (N=449) AIC= -3969.35 ll= -3965.35

While the more reasonable model would be 2 groups with (1,1) or (2,2). For the former, the stats are:

1,1: BIC= -3679.76 (N=1554) BIC= -3676.04 (N=449) AIC= -3663.72 ll= -3657.72

As I understand, lower BIC and AIC values are preferable. Here the 0,0 model has lowest BIC (-3977.56) compared to the 1,1 model (-3676.04), and thereby the model to be preferred? This goes against our expectation of the progression, and the trajplot cleary suggest the 1,1 model to give a better fit. Should we pick the model with smallest absolute BIC/AIC? I see several authors have used this strategy, even if the literature seems to suggest that lowest values are best:

https://stats.stackexchange.com/questions/84076/negative-values-for-aic-in-general-mixed-model.

Can you help us?

## apwheele

/ August 9, 2019Yeah it is confusing, most people report AIC/BIC in positive numbers, so a lower value is better fit. Nagin and company always report it the opposite in their software with negative BIC/AIC values, so a higher value (closer to zero) is better.

I’ve never sat down and figured out why different folks do it differently!

So your perceptions match up with your results.

## Heine Strand

/ August 9, 2019Thanks, really reassuring! Saved my day 🙂

## Hyeonmi

/ September 18, 2019Thanks for your posting. For comparing the results, how can I get a Relative Entropy?

## apwheele

/ September 19, 2019I haven’t seen reference to that one in this context — that would be looking at say if you did a mixture of 3 groups vs a mixture of 4 groups? Or are you asking something different?

## Viviane

/ September 26, 2019Hi Andrew,

Thanks a lot for your helpful post!

I wonder if the do.file above to generate [groupAPP occ TotProb , etc] can be also applied straightly for logistic models (binary variables) – without any modification.

Many thanks in advance,

Viviane Straatmann

## apwheele

/ September 26, 2019It takes advantage of standardized variable names post the traj command. In particular _traj_ProbG* stores the posterior probabilities for each mixture, and _traj_Group stores a integer label for each group.

So if you say you did something like below I think it will work:

****************************

logit y x

predict _traj_ProbG1

gen _traj_ProbG2 = 1 – _traj_ProbG1

gen _traj_Group = (_traj_ProbG1 < 0.5) + 1

summary_table_procTraj

****************************

Where group 1 is the positive class, and group 2 is the 0 class. I haven't given any thought though as to whether this makes sense in this context.

## Viviane

/ September 27, 2019Thanks for replying!

I’m using the command below to generate the groups (i’m still working on decisions of trajectory shapes and number of groups):

traj, model(logit) var(poverty_5-poverty_18) indep(year_1-year_5) order(1 1 1)

From this I get the _traj_Group _traj_ProbG1 _traj_ProbG2 _traj_ProbG3, and then I am running your do.file (below), that seems to be working fine.

However, I wanna check with you if the calculation used in your program (tested in your example above with a zip model) also works in a logit model.

program summary_table_procTraj_VSS2

preserve

*now lets look at the average posterior probability

gen Mp = 0

foreach i of varlist _traj_ProbG* {

replace Mp = `i’ if `i’ > Mp

}

sort _traj_Group

*and the odds of correct classification

by _traj_Group: gen countG = _N

by _traj_Group: egen groupAPP = mean(Mp)

by _traj_Group: gen counter = _n

gen n = groupAPP/(1 – groupAPP)

gen p = countG/ _N

gen d = p/(1-p)

gen occ = n/d

*Estimated proportion for each group

scalar c = 0

gen TotProb = 0

foreach i of varlist _traj_ProbG* {

scalar c = c + 1

quietly summarize `i’

replace TotProb = r(sum)/ _N if _traj_Group == c

}

gen d_pp = TotProb/(1 – TotProb)

gen occ_pp = n/d_pp

*This displays the group number [_traj_~p],

*the count per group (based on the max post prob), [countG]

*the average posterior probability for each group, [groupAPP]

*the odds of correct classification (based on the max post prob group assignment), [occ]

*the odds of correct classification (based on the weighted post. prob), [occ_pp]

*and the observed probability of groups versus the probability [p]

*based on the posterior probabilities [TotProb]

list _traj_Group countG groupAPP occ occ_pp p TotProb if counter == 1

restore

end

summary_table_procTraj_VSS2

Thanks again for you support!

Kind regards,

Viviane

## apwheele

/ September 27, 2019Sorry I misinterpreted — yes it works the same no matter what the link function for the outcome you are using is. Those stats are all about the latent classes/mixtures, they aren’t tied directly the what the nature of the outcome is.

## Viviane

/ September 29, 2019Many thanks, Andrew! 🙂

## sylvia

/ November 7, 2019Dear Prof Wheeler,

Thanks for this post. It is very helpful.

I have a question about starting values for maximum likelihood estimation in stata?

Also, how do I estimate entropy?

Thanks.

## apwheele

/ November 7, 2019Sorry, won’t be able to help on that end so much. The help has an example of providing starting values. Had another person ask about Entropy — not sure exactly what that means in this context (so if I had an example could likely show how to program it, but need something to go off of).

## Priscila

/ November 18, 2019Dear Prof Wheeler. Thank you very much for this post. It was very useful!

I’m working on a 6 points time trajectories with missing data in 5 of them. I’ve heard that traj command can automatically account for the missings but after I ran the analysis (traj command plus your commands for absolute model fit), I realized that something must be wrong. The TotProb is totally different from the p, I´ve also realized that Totprob corresponds exactly to the proportions in the plot (graph), and the p values are identical to those I get from “tab traj_group”. My concern is that that the proportions plotted are totally different from those for traj_group. Do you have any idea why this is happening? Do I have to specify something relating to missings? Thank you very much!

## apwheele

/ November 18, 2019That isn’t per se a problem with missing data. “p” is the assignment based on the maximum posterior probability, whereas TotProb is based on the weights assigned for each group. Them not being close suggests a potential problem with absolute fit of the model.

I haven’t personally had a problem with that situation. It may be you have groups that are very similar, so are often near 50% in their group assignment. So in my table if the “GroupAPP” is very low that would also be a sign. So it may mean you need fewer groups I would guess.

## Fanz

/ November 26, 2019Hi Prof Wheeler,

I am using the GBTM for the analysis and I’ve found the optimal number of four trajectories, all with a three-order polynomial. I plotted the graph and I am wondering how I could smooth the graph.

Thanks.

## apwheele

/ November 26, 2019I don’t understand — the graphs should be pretty smooth already! (Four cubic functions)

## Priscila Weber

/ February 18, 2020Hi Professor Wheeler!

I want to use weighted multinomial logistic regression (weights based on posterior probabilities). I know it has been done with latent class analysis which gives you the probabilities at an individual level. Do you know how I can do it with GBTM? I haven’t found the commands in Stata, so I have no idea how to include the weights in the regression! Thank you very much for your attention and shared knowledge

## apwheele

/ February 18, 2020Not 100% sure what you mean — you mean you want to predict the probability of belonging to a latent class based on time-invariant characteristics? (Or something else?)

## Priscila Weber

/ February 19, 2020Somenthing Else! I already have my trajectories (wheezing trajectories) and also the posterior probabilities of assignment for each group. Now I want to investigate the associations of these trajectories with early risk factors using multinomial logistic regression. However I would like to weight these analyses with the posterior probabilities. It seems that researches are doing this as a way to “account” for uncertainty of assignment. Just to be more “accurate”. It seems that is possible to do it, cause I’ve read it in a paper, but I couldnt find a way to include the weights in the regression! Thank you so much!

## apwheele

/ February 22, 2020I am pretty sure you are talking about a model to say why a particular row belongs to a latent group. Check out the “Time-Stable Covariates” section in this document, https://ssrc.indiana.edu/doc/wimdocs/2013-03-29_nagin_trajectory_stata-plugin-info.pdf.

You put in your early risk factors into the “risk(x1 x2 )” part of the function.

## Martin CHETEU

/ April 1, 2020Hi professor I am not able to install traj in my Stata 13.here is the message I got when trying to install. I have tried to set the time out but the result is the same. need help plese

net install traj, force

connection timed out — see help r(2) for troubleshooting

could not load traj.pkg from http://www.andrew.cmu.edu/user/bjones/traj/

r(2);

. set timeout1 60

. net install traj, force

connection timed out — see help r(2) for troubleshooting

could not load traj.pkg from http://www.andrew.cmu.edu/user/bjones/tra

## apwheele

/ April 1, 2020See if you can download the files yourself first and then install locally. It could just be a server issue at the time you did it.

## Safiya Alshibani

/ June 18, 2020Thanks a lot for the example and codes,

I see that some have asked you about the “entropy”. According to the is paper it’s a suggested measure to assess the discrimination between the groups (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4254619/)

I tried to calculate the entropy for a model with 3 classes as follow (Excuse the primitive way of coding)

gen log_grob1 = ln(_traj_ProbG1)

gen log_grob2 = ln(_traj_ProbG2)

gen log_grob3 = ln(_traj_ProbG3)

gen total_row= (-1*(_traj_ProbG1*log_grob1 + ///

_traj_ProbG2*log_grob2 + ///

_traj_ProbG3*log_grob3)/ln(3))

bys _traj_Group : summ total_row

Q1: do you think my approach of calculating this “entropy” correct ?!

Q2: Also I’m not sure about the threshold, the paper mentioned above suggest a lower value of entropy is better, and other empirical papers suggest higher values are better (i.e. https://journals.sagepub.com/doi/full/10.1177/0081246317721600

https://psycnet.apa.org/record/2010-22993-004

https://link.springer.com/article/10.1007/s00420-017-1277-0)

## apwheele

/ June 18, 2020So just reading the pubmed paper, your gen total_row works out as the individual person measure of entropy. For that measure they do not do any group metrics, that one is just for individuals, so the by part doesn’t make sense.

Haven’t read the other papers. Lower values are definitely better, but the pubmed article gives examples where it doesn’t work out so well. (If folks are saying higher values are better, make sure they are calculating the negative in front of the entropy measure, if not then closer to 0 will be better.)

It would take more work to implement the pubmed article suggested hypothesis testing approach. So I won’t be taking that on anytime soon.

## Hilary Caldwell

/ July 14, 2020Hi! Thank you for this post, very helpful for getting my trajectories set up. I have two questions:

1) I’m having trouble with the summary stats. I get an error after entering this code:

program summary_table_procTraj

Am I missing something??? Do you enter this after running the model?

2) Do you have experience using an outcome with TRAJ? I’m looking for some help with this.

## apwheele

/ July 14, 2020Yeah you should enter it after running the model. It scoops up the variables that the traj model creates to estimate the summary stats.

Not sure what you mean by 2, you mean a time invariant variable predicting the probability of class assignment?

## Kate

/ December 2, 2020Hi Prof Wheeler,

Thanks so much for this extremely helpful post and code. I have the same question as others regarding relative entropy. Not sure if this description helps to explain what I mean. It is from the following article: Rens van de Schoot, Marit Sijbrandij, Sonja D. Winter, Sarah Depaoli & Jeroen K. Vermunt (2017) The GRoLTS-Checklist: Guidelines for Reporting on Latent Trajectory Studies, Structural Equation Modeling: A Multidisciplinary Journal, 24:3, 451-467, DOI:10.1080/10705511.2016.1247646

The authors suggest it should be a reporting requirement for trajectory studies.

“The relative entropy is also called a measure of “fuzziness” of the derived latent

classes (Jedidi, Ramaswamy, & DeSarbo, 1993; Ramaswamy, DeSarbo, Reibstein, & Robinson, 1993). The relative entropy takes on a value of 0 when all of the posterior probabilities are equal for each subject (i.e., all participants have posterior probabilities of .33 for each of three latent classes). When each participant perfectly fits in one latent class only, the relative entropy receives a maximum value of 1, which indicates that the latent classes are completely discrete partitions. Therefore, an entropy value that is too low is cause for concern, as it implies that people or cases were not well classified, or assigned to latent classes. Thus, as stated by Celeux and Soromenho (1996) the relative entropy can be regarded as a measure of the ability of the latent trajectory model to provide a relevant partition of the data; a nice explanation is provided by Greenbaum, Del Boca, Darkes, Wang, and Goldman (2005,p. 233).”

Also, a much more basic question: how could I produce a separate trajplot for each group, so the group trajectories are visualised in separate graphs rather than the combined default graph?

Thanks in advance!

## apwheele

/ December 2, 2020For the first question, I think this will work (I do not have easy access to Stata anymore to check!). This generates an entropy measure for each individual row in your dataset, not sure how the paper you mention suggests to report this (the mean entropy for the dataset? mean per assigned group?) And you would likely also want to note what the max entropy is, log(n) where n is the number of groups, so you may want to report a normalized entropy value.

##############

gen Ent = 0

foreach i of varlist _traj_ProbG* {

replace Ent = Ent + `i’*log(`i’)

}

replace Ent = -1*Ent

##############

For the question about the graphs, I think this will work:

##############

# check out what is available

preserve

ereturn list

# This gives you the data in the plot

matrix p = e(plot1)

drop _all

svmat p, names(col)

#Plot code here!!!

#This gives you the data in wide format

#So may need to reshape into long

# Now go back to your original data

restore

##############

## Siobhan Nowakowski

/ December 23, 2020Hello, I have a very basic question.

I am able to run my GBTM using the code you provided, thank you. However, when I attempt to add risk factors, I get an error message saying “incorrect # of start values. These should be X start vales”. I receive this error message even though I am using the start values provided in the log window.

Are you at all familiar with this issue and how to get around it? Thank you for any help that you can offer

## apwheele

/ December 23, 2020Not sure about that one. Is that an error when running my program, or when estimating the GBTM model to begin with? It sounds like the latter.

## Siobhan Nowakowski

/ December 23, 2020I use the Proc Traj macro in SAS. I only encounter this problem when adding risk factors after the initial model is estimated. If I run risk factors individually, there is no issue. I may have to do that if I am unable to sort out the problem.

## apwheele

/ December 24, 2020I won’t be able to help with that. Maybe emailing the person who created it (Bobby Jones/Dan Nagin) would be your best bet.

## Dongling Luo

/ February 22, 2021Hi Prof Wheeler,

Thanks so much for sharing this code. I am wondering whether there’s any restriction for the dependent variables: should the repeated measurements be conformed to normal distribution and do I need to log-transform these data before running GBTM?

Thanks in advance and looking forward to hearing from you soon!

## apwheele

/ February 22, 2021The library has examples for different link functions. So normal is one, but you could have 0/1 data, or Poisson count data, and then just use the particular link function that works the best for your data. (So it may make sense to log-transform, or may not if it is say counts of something.)

## Dongling Luo

/ February 22, 2021But it seems that the result is different if I log-transform the non-normal distribution data (continuous variable), using the “cnorm” function. I am not sure whether I should log-transform my data (non-normal distribution). Both trajectories seem reasonable. Or could I just simply compare the BIC/AIC to choose the best one? Really appreciate your help!

## apwheele

/ February 23, 2021It looks like AIC may be feasible, although you will need to transform the results of the logged model to be comparable to the original scale, https://stats.stackexchange.com/a/343176/1036.

## Jane Brathwaite

/ May 18, 2021Hello Prof Wheeler, thank you so much for creating the traj command for Stata. I am wondering, is there any way to know which participants (IDs) are sorted into the respective trajectory groups? I can’t seem to find this info in the output.

## apwheele

/ May 18, 2021After you run the command, there should be a variable, _traj_Group, that contains the group that has the highest posterior probability for a person to belong to.

FYI I am not the one who made the traj command itself, I have only blogged about some extra stuff. It was Dan Nagin/Bobby Jones who did the hard work, https://journals.sagepub.com/doi/abs/10.1177/0049124113503141.

## Jane Braithwaite

/ May 18, 2021Thank you so much! It worked!

## diego yacaman mendez

/ June 28, 2021Hi Andrew,

I noticed you implemented a measure of Entropy in the update of traj for Stata.

How is the entropy calculated? And how should it be interpreted? The lower the better?

Thanks and regards,

## apwheele

/ June 28, 2021I am not the person who wrote the original plug in, that is Dan Nagin and Bobby Jones, https://www.andrew.cmu.edu/user/bjones/. So please consult their documentation. But typically higher entropy means things are more random.

## Jesús

/ July 28, 2021Hello Prof Wheeler,

Your code was really useful for me. I’m a young PhD student starting modeling trajectories and I’m a bit lost, so I would like to ask you some basic questions. I have three time-points and a continuous variable (it is a metric that goes from 0 to 7).

1. I was using cnorm model. Is it correct or should I use another one?

2. I do not know how to decide the number of groups and the polynomial type for each group trajectory (“order”). How do you decide this? How do you know if you have to put the 0=intercept, 1=linear, 2=quadratic, 3=cubic?

Thanks in advance,

## apwheele

/ July 29, 2021For 1 cnorm may make sense/be ok. If it is say a Likert item that only takes on integers from 0-7, some type of ordinal model may be more appropriate.

For 2, with 3 time points the maximum polynomial you can fit is a quadratic. You cannot fit a cubic with only 3 time points.

## Nandita Krishnan

/ August 13, 2021Hello, I am using GBTM to estimate dual trajectories of e-cigarette and cigarette use for my dissertation. This is super helpful! In deciding on the final model, one criterion that Nagin recommends is “close correspondence between the estimated probability of group membership and the proportion assigned to that group based on the posterior probability of group membership” – based on the table you generated, would that simply entail comparing p with TotProb?

Also, Nagin recommends “observing reasonably tight confidence intervals around estimated

group membership probabilities”. How are confidence intervals for group membership probabilities calculated? Would it simply be group membership probability (as provided in the stata output after running the model) +/- 1.96 SE?

## apwheele

/ August 13, 2021For Q1 yes it is comparing p vs TotProb in these tables.

For Q2 about the CI in membership probabilities I have never seen a calculation of that directly. So maybe a mistake on Nagin’s part? Maybe he mean high posterior probabilities in at least one group? (I have no clue how to estimate individual CI’s, I suppose you could garner that if you have an equation for the membership probabilities, but if you don’t have an equation for that I don’t know.)

## Nandita Krishnan

/ August 15, 2021Thank you very much! Another thing I am unsure about is if it is possible to have an APP of 1? I am getting that for a few models and in that case, I am unable to obtain an estimate of occ (as the denominator is 0). Is this theoretically possible or an indication that the model is not viable?

## apwheele

/ August 15, 2021It probably does suggest your model is not identified and there is some perfect separation. I have seen this for the trajectory estimates, I have never seen this for the posterior probs though.

## Alissa Burnett

/ October 6, 2021I am suddenly having issues with the Traj plug in, Stata 16 and 17 are not recognising the traj command, I get the error message “unrecognized command”

Is anyone else having this issue / have a fix for it?

## Nandita Krishnan

/ October 6, 2021I use Stata 14 and I’ve been having issues too – but a different one. where it crashes each time I try to run the traj command. I’ve been able to run it on my university’s virtual computing platform though. Not sure if it’s a Mac issue (do you use a Mac?) or an issue with the traj plug in…

## Alissa Burnett

/ October 31, 2021No I’m using a PC and have now tried 3 different computers and it is not working. Reinstalling does not fix the issue.

## apwheele

/ October 7, 2021Sorry, won’t get time to check this out for quite a bit. I would just suggest reinstalling and see if that fixes the issue, https://www.andrew.cmu.edu/user/bjones/.

## Nandita Krishnan

/ October 27, 2021Does anyone know if it’s possible to model linear splines using the traj plug-in?

## apwheele

/ October 27, 2021Since the traj plug-in allows you to incorporate covariates, you could just set the polynomial order term to 0, and include whatever spline basis as covariates in wide format.

## Nandita Krishnan

/ October 27, 2021Thank you! Would I use time varying covariates tcov? And if my time variables are coded as t_0 to t_4, and I want the spline at t_1, would I code it as tcov(t_1)?

## apwheele

/ October 27, 2021Yes pretty much, (you should likely have the full set of basis functions though, so probably need covariates to t_0, t_1, etc. to make sense). Will try to slate a blog post to illustrate this sometime.

## Nandita Krishnan

/ October 27, 2021Will play around with it, thank you! And will keep an eye out for a blog post!

## Getu

/ November 25, 2021Hello Prof Wheeler,

Your code was really useful for me. I’m a young PhD student starting modeling trajectories and I’m a bit lost, so I would like to ask you some basic questions. I have thirteen time-points data collected monthly and a continuous variable (height-for-age Z score).

1. I do not know how to decide the number of groups??

2. … And the polynomial type for each group trajectory (“order”). How do you decide this? How do you know if you have to put the 0=intercept, 1=linear, 2=quadratic, 3=cubic?

For example if we have 3 class what is our ground to make the order (1 1 1) or (2 2 2) or (2 3 1) or (3 3 3) etc…

Thanks in advance,

## apwheele

/ November 25, 2021In terms of the number of groups, you typically run the models with multiple groups (e.g. 1, 2, 3, etc.) and use the fit AIC/BIC (plus those fit statistics in this post) to identify the best model.

In terms of the polynomial, that should be dictated by your knowledge of the system. Heights are monotonically increasing I presume, and over this short of a time range I believe a linear model will be fine. If you wanted to quadratic as an *approximation* to a non-linear function (although again a function that is only monotonically increasing probably makes more sense). I have a hard time imagining a cubic though with your description.

## Getu

/ November 26, 2021Thank you! really helpful.

## Jesús martínez Gómez

/ November 26, 2021Hello Getu,

I was in the same point as you several months ago. I highly recommend you the following book: Nagin, D. (2009). Group-based modeling of development. Harvard University Press.

It is really well explained and you would find all your answers. This page website is really helpful too. Thanks Prof Wheeler!

## apwheele

/ November 26, 2021Yes agree that book is really good for an introduction — much easier to understand everything than by reading academic papers.

## Getu Gizaw

/ November 26, 2021Dear Andrew Wheeler,

Thank you very much! is it freely accessible?

## apwheele

/ November 26, 2021Not that I know of — I borrowed it from the Uni library when I was learning this stuff.

## Getu Gizaw

/ November 26, 2021oops we from a developing country ( am from Ethiopia) do not have digital care. so kindly request you to attach to me.

## Getu

/ December 19, 2021Dear Prof Wheeler,

In follow up of my above question, I am still trying to model Group based trajectory and challenged with -BIC score continues to increase as more groups are added to the model- by using Jeffreys’s scale of evidence for Bayes factors ( using dis exp(BICi−BICj) command.

I also tried Judging Model Adequacy with the following diagnostic criteria:

1. Average Posterior Probability of Assignment

2. Odds of Correct Classification

3. Estimated Group Probabilities versus the Proportion of the Sample Assigned to the Group

4. Confidence Intervals for Group Membership Probabilities

Based on the above four criteria my model is adequate/ accurate starting from having two group and onward.

So, I need your support how to select my best model?

Thank you in advance!

## apwheele

/ December 19, 2021Generally you continue to add more groups until they are visibly indistinguishable from each other, or the model itself has difficulty converging.

## Getu

/ December 31, 2021Hi prof Apwheele,

Adding more groups still increases BIC ,which makes the model complex and difficulty in model interpretation. i am thinking of convergent problem. so how i can solve this issue in traj model?

thank you!

## Sarah B

/ February 23, 2022Thank you for your useful code. Unfortunately when I go to use it in STATA i get the following:

. net from http://www.andrew.cmu.edu/user/bjones/traj

file http://www.andrew.cmu.edu/user/bjones/traj/stata.toc not found

server says file temporarily redirected but did not give new location

http://www.andrew.cmu.edu/user/bjones/traj/ either

1) is not a valid URL, or

2) could not be contacted, or

3) is not a Stata download site (has no stata.toc file).

r(601);

Any ideas what is going wrong?

## apwheele

/ February 23, 2022Try https instead of http

## Sarah B

/ February 23, 2022Thanks you that’s worked 🙂

## Sarah B

/ February 28, 2022Hi,

So I’ve downloaded the traj plug ins but I’m now faced with a “unrecognized command r(199)” error when I try to run my data.

I tried inputting BJones test data and command based on the tutorial found here but I get the same error message.

Click to access lociganic-stata-tutoria.pdf

Any ideas what’s going wrong?

Thanks, Sarah

## apwheele

/ February 28, 2022Yeah not sure about that. After install I would do `help traj` like in the linked tutorial. If that pops up I would think the install is OK. If not then the install may not be working out for whatever reason.

## Caroline

/ March 31, 2022Hi Andrew,

Nagin mentions in his book that another fit statistic would be the Confidence Intervals for Group Membership Probabilities (so TotProb). Do you know how to add 98% CI’s to the variable?

Thank you for your help and this great article!

Caroline

## apwheele

/ March 31, 2022Not 100% sure the correct confidence interval you want. You could maybe do something like:

egen loc_prob =rowmax(_traj_ProbG*)

by _traj_Group: ci means loc_prob, level(98)

## Caroline

/ April 5, 2022Thank you!!

It’s the confidence interval or equally I suppose I could use the standard deviation of group membership probabilities (TotProb) for each group. Nagin says that if the intervals are narrow (less than 0.08 plus /minus TotProb) it’s the fourth diagnostic for class enumeration.

Would you still think that the above code gives me what I’m after?

Thanks so much in advance 🙂 it’s really appreciated!

## apwheele

/ April 5, 2022Yeah that is the normal based confidence interval given the max posteriori probability for each group. (It may make sense to do a transform since restricted to 0/1, calculate CI on transformed, and then turn the CI back into probabilities, but this is really just a diagnostic, not something being used for actual NHST I believe.)

Since the values are bounded in particular ways, I don’t think this will be very diagnostic above the other stats already presented. Also note the smaller groups can have wider CIs even if the posterior probs have a similar standard deviation.

## Raza

/ April 7, 2022Hi Andrew,

I need help in understanding why some trajectories are missing in _traj_Group?

I am using your plugin to run grouped based trajectoy models in Stata. I have 19 years of follow-up and I am using the following syntax and followed is the output of the syntax

traj if group==1, var( exposure_*) indep(t_*) model(zip) order(2 2 2 2 )

Here is an outpt

==== traj stata plugin ==== Jones BL Nagin DS, build: Jan 8 2018

60998 observations read.

51588 excluded by if condition.

9410 observations used in trajectory model.

Maximum Likelihood Estimates

Model: Zero Inflated Poisson (ZIP)

Standard T for H0:

Group Parameter Estimate Error Parameter=0 Prob > |T|

1 Intercept -0.10053 0.02925 -3.437 0.0006

Linear 0.05324 0.00808 6.585 0.0000

Quadratic 0.00140 0.00041 3.404 0.0007

2 Intercept 0.04437 0.02330 1.904 0.0569

Linear 0.00554 0.00573 0.966 0.3341

Quadratic -0.00010 0.00029 -0.343 0.7316

3 Intercept 0.04439 0.00954 4.654 0.0000

Linear 0.00553 0.00228 2.430 0.0151

Quadratic -0.00010 0.00011 -0.878 0.3799

4 Intercept -0.00685 0.02381 -0.288 0.7735

Linear 0.20418 0.00532 38.369 0.0000

Quadratic -0.00746 0.00028 -26.968 0.0000

Group membership

1 (%) 8.65836 0.44359 19.519 0.0000

2 (%) 12.49223 383.21558 0.033 0.9740

3 (%) 69.66792 383.21579 0.182 0.8557

4 (%) 9.18149 0.39125 23.467 0.0000

BIC=-178850.40 (N=155079) BIC=-178829.39 (N=9410) AIC=-178775.77 L=-178760.77

When I tab _traj_Group, it does not return values for trajectory group 2

_traj_Group | Freq. Percent Cum.

————+———————————–

1 | 708 7.52 7.52

3 | 7,855 83.48 91.00

4 | 847 9.00 100.00

————+———————————–

Total | 9,410 100.00

Can you tell what can possibly be wrong here?

## apwheele

/ April 7, 2022Well, for my code above it will not be quite right with the missing data, so you could do something like:

program summary_table_procTraj2

preserve

drop if missing(_traj_Group)

….

To fix my code above for your situation.

My guess is that trajectory 2 in your example has no high posterior probabilities, so that variable assigns none of the values to that group. Eyeballing the coefficients, it is very similar to traj group 3. This just suggests you don’t need 4 groups, and 3 will suffice.

My code table above will still not generate summaries for group2 in your scenario. I would have to do some more work, as it relies on the _traj_Group variable.

## Raza

/ April 7, 2022That’s correct!

Your response has helped me to understand why STATA is returning missing values.

Thanks a lot!

## Lukas

/ May 4, 2022Hi Professor Wheeler,

Thanks for the excellent post! I have a question and would be grateful if you could have a look when you get a chance.

Here is my output:

4300 observations read.

308 had no trajectory data.

3992 observations used in the trajectory model.

But the _traj_Group variable has 4300 nonmissing values (shouldn’t this be 3992?).

How can I determine the 308 observations, and I found there were less than 308 participants who only had 0 or 1 visit. Also, should I say the trajectory is modeled based on 3992 participants or all of the 4300 participants?

Regards,

Lukas

## apwheele

/ May 4, 2022If an observation is not used in the trajectory modelling process, I believe it is not assigned posterior probabilities. So maybe filtering the set post that will illustrate what the issues are. I am not sure offhand why it would be dropping cases.

## Lukas

/ May 4, 2022Thanks Professor Wheeler. Absolutely, it is very strange that all observations, including those not used in the model, were assigned posterior probabilities. As your suggested, I have double-checked the probabilities of these unused observations and found they have the same posterior probabilities (and very low). Although I’m not sure why this happens, the low posterior probabilities may be a good indicator to identify which observations were not used.

Regards,

Lukas

## JM

/ June 2, 2022Professor may I ask if there is a solution for me because I cannot even install the package. Please help me on it.

## apwheele

/ June 3, 2022See my note at the beginning of this post, https://andrewpwheeler.com/2021/11/29/splines-in-stata-traj-models/. Try using https, or downloading the data yourself and pointing your Stata install to the local files.