All posts in category ggplot2

Age-Period-Cohort graphs for suicide and drug overdoses

When I still taught advanced research methods for PhD students, I debated on having a section on age-period-cohort (APC) analysis. Part of the reason I did not bother with that though is there were no good open source datasets (that I was aware of). A former student asking about APC analysis, as well as a recent NBER working paper on suicide rates (Marcotte & Hansen, 2023) brought it to mind again.

I initially had plans to do more modelling examples, but I decided on just showing off the graphs I generated. The graphs themselves I believe are quite informative.

So I went and downloaded mortality rates USA mortality rates for suicides and drug overdoses, spanning 1999-2022 for suicide and 1999-2021 for drug. Here is the data and R code to recreate these graphs in the post to follow along.

To follow along here in brief, we have a dataset of death and population counts, broken down by year and age:

# Age-Period-Cohort plots
library(ggplot2)

# Read in data
suicide <- read.csv('Suicides.csv')

# Calculate Rate & Cohort
suicide$Cohort <- suicide$Year - suicide$Age
suicide$Rate <- (suicide$Deaths/suicide$Population)*100000

# Suicide only 11-84
suicide <- suicide[suicide$Age >= 11,]
head(suicide)

And this produces the output:

> head(suicide)
   Age Year Deaths Population Cohort      Rate
16  11 1999     22    4036182   1988 0.5450696
17  11 2000     24    4115093   1989 0.5832189
18  11 2001     24    4344913   1990 0.5523701
19  11 2002     22    4295720   1991 0.5121377
20  11 2003     15    4254047   1992 0.3526054
21  11 2004     18    4207721   1993 0.4277850

A few notes here. 1) I limited the CDC Vital stats data to 1999, because in the Wonder dataset pre-1999 you can’t get individual year-age breakdowns, you need to do 5 year age bins. This can cause issues where you need to age-adjust within those bins (Gelman & Auerbach, 2016), that should be less of a problem with single year breakdowns. So I would go back further were it not for that. 2) When breaking down to individual years, the total count of suicides per age bracket is quite small. Initially I was skeptical of Marcotte & Hansen’s (2023) claims of LGBTQ subgroups potentially accounting for increased trends among young people (I just thought that group was too small for that to make sense), but looking at the counts I don’t think that is the case.

When I think about age-period-cohort analysis, my mind goes age effects > period effects > cohort effects. I think people often mix up cohort effects with things that are actually age effects. (And also generation labels are not real.) In criminology, the age-crime-curve was established back in the 1800’s by Quetelet.

So I focus on graphing the age curve, and look at deviations from that to try to visually identify period effects or cohort effects. Here is the plot to look at each of the age curves, broken down by year.

ap <- ggplot(data=suicide, aes(x = Age, y = Rate, color=Year, group=Year)) + 
             geom_line() +
             scale_colour_distiller(palette = "PuOr") +
             scale_x_continuous(breaks=seq(10,80,10)) +
             scale_y_continuous(breaks=seq(0,30,5)) + 
             labs(x='Age',y=NULL,title='Suicide Rate per 100,000',caption="USA Rates via CDC Wonder")
ap

When using diverging color ramps to visualize a continuous variable, you get a washed out effect in the middle. So I am not sure the best color ramp here, but it does provide a nice delineation and gradual progression from the curve in the early 2000’s compared to the suicide curve in 2022. (Also spot the one outlier year, it is age 75 for the “provisional” 2022 counts. I leave it in as it is a good showcase for how plots can help spot bad data.)

The blog the graph will be tinier, open it up in a new tab on your desktop computer to get a good look at the full size image.

Here looking at the graph you can see two things other researchers looking at similar data have discussed. In the early 2000’s, you had a gradual increase from 20’s to the peak suicide rate at mid 40’s. More recent data has shifted that peak to later ages, more like peak 55. Case & Deaton (2015) discussing deaths of despair (of which suicide is a part) focussed on this shift, and noted that females in this age category increased at a higher rate relative to males.

Marcotte & Hansen (2023) focus on the younger ages. So in the year 2000, the age-suicide curve was a gradual incline from ages early 20’s until the peak. Newer cohorts though show steeper inclines in the earlier ages, so the trend from ages 20-60 is flatter than before.

Period effects in these charts will look like the entire curve is the same shape, and it is just shifted up and down. (It may be better to graph these as log rates, but keeping on the linear scale for simplicity.) We have a bit of a shape change though, so these don’t rule out cohort effects.

Here is the same plot, but grouping by cohorts instead of years. So the age-suicide curve is indexed to the birth year for an individual:

cp <- ggplot(data=suicide, aes(x = Age, y = Rate, color=Cohort, group=Cohort)) + 
             geom_line() +
             scale_colour_distiller(palette = "Spectral") +
             scale_x_continuous(breaks=seq(10,80,10)) +
             scale_y_continuous(breaks=seq(0,30,5)) +
             labs(x='Age',y=NULL,title='Suicide Rate per 100,000',caption="USA Rates via CDC Wonder")
cp

My initial cheeky thought (not that there aren’t enough ways to do APC models already), was to use mixture models to identify discrete cohorts. Something along the lines of this in the R flexmix package (note this does not converge):

library(flexmix)
knot_loc <- c(20,35,50,65) # for ages
model <- stepFlexmix(cbind(Deaths, Population - Deaths) ~ bs(Age, knot_loc) | Cohort, 
                     model = FLXMRglm(family = "binomial", fixed = ~Year),
                     data = suicide, k = 3)

But there is an issue with this when looking at the cohort plot, you have missing data for cohorts – to do this you would need to observe the entire age-curve for a cohort. There may be a way to estimate this using latent class models in Stata (and fixing some of the unidentified spline coefficients to a fixed value), but to me just looking at the graphs I think is all I really care about. You could maybe say the orange cohorts in the late 90’s are splitting off, but I think that is consistent with period effects. (And is also a trick of the colors I used in the plot.)

You could do mixtures for the year plots, see some of the work by Elana Erosheva (Erosheva et al., 2014), but that again just isn’t how I think about APC analysis.

Doing this same exercise for drug overdoses rates, (which I not can overlap with suicide – you can commit suicide via intentionlly taking too many drugs) we can clearly see the dramatic rise in recent years. We can also see the same trends in earlier ages now being peak, but also increases and shifts to older ages.

The cohort plot here looks like a Spinosaurus crest:

Which I believe is more consistent with (very strong) period effects, not so much cohort effects. Drug overdoses are increasing across both younger and older cohorts.

Nerd Notes

These datasets don’t have covariates, which to use the APC method in Spelman (2022) you would need those (it uses covariates to estimate period effects). I am not so sure that is the best approach to APC decomposition, but it is horses for courses.

What I wish is that the CDC distributed the vital statistics data at the micro level (where each row is a death, with all of the covariates), along with a matching variable dataset of the micro level American Community Survey and the weights. That doesn’t solve the APC issue with identifying the different effects, but makes it easier to do more complicated modelling, e.g. I could fit models or generate graphs for age-gender differences more easily, decompose different death types, etc.

Final nerd note is about forecasting mortality trends. While I am familiar with the PCA-functional data approach advocated by Rob Hyndman (Hyndman & Ullah, 2007), I don’t think that will do very well with this data. I am wondering if doing some type of multi-level GAM model, and doing short term extrapolation of the period effect (check out Gavin Simpson’s posts on multi-level smooths, 1, 2, 3).

So maybe something like:

library(mgcv)
smooth_model <- gam(cbind(Deaths, Population - Deaths) ~ s(Year) + s(Age,by=Cohort), 
                    family = binomial("logit"),
                    data = suicide)

Or maybe just use s(Age,Year) and not worry about the cohort effect. Caveat emptor about this model, this is just my musings, I have not in-depth studied it to make sure it behaves well (although a quick check R does not complain when fitting it).

References

Case, A., & Deaton, A. (2015). Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century. Proceedings of the National Academy of Sciences, 112(49), 15078-15083.
Erosheva, E. A., Matsueda, R. L., & Telesca, D. (2014). Breaking bad: Two decades of life-course data analysis in criminology, developmental psychology, and beyond. Annual Review of Statistics, 1, 301-332.
Gelman, A., & Auerbach, J. (2016). Age-aggregation bias in mortality trends. Proceedings of the National Academy of Sciences, 113(7), E816-E817.
Hyndman, R. J., & Ullah, M. S. (2007). Robust forecasting of mortality and fertility rates: A functional data approach. Computational Statistics & Data Analysis, 51(10), 4942-4956.
Marcotte, D. E., & Hansen, B. (2023). The Re-Emerging Suicide Crisis in the US: Patterns, Causes and Solutions (No. w31242). National Bureau of Economic Research.
Spelman, W. (2022). How cohorts changed crime rates, 1980–2016. Journal of Quantitative Criminology, 38(3), 637-671.

The spatial dispersion of NYC shootings in 2020

If you had asked me at the start of widespread Covid lockdown measures what the effect would be on crime, I am pretty sure I would have guessed it will make crime go down. Fewer people out and about causes fewer interactions that can lead to a crime. That isn’t how it has shaped up though, quite a few places have seen increases in serious violent crime. One of the most dramatic examples of this is that shootings in NYC doubled from 900 in 2019 to over 1800 in 2020. I am going to show how to generate this chart later via some R code, but it is easier to show than to say. NYPD’s open data on shootings (historical, current) go back to 2006.

I know I am critical on this site of folks overinterpreting crime increases, for example going from 20 to 35 is pretty weak evidence of an increase given the inherent variance for low count Poisson data (a Poisson e-test has a p-value of 0.04 in that case). But going from 900 to 1800 is a much clearer signal.

Jerry Ratcliffe recently posted an R library to do his crime dispersion analysis, so I figured this would be an excellent example use case. The idea behind this analysis is spatial – we know there is a crime increase, but did the increase happen everywhere, or did it just happen in a few locations. Here I am going to use the NYPD shooting data aggregated at the precinct level to test this.

As another note, while I often use micro-spatial units of analysis in my work, this method, along with others (such as the sppt test), are just not going to work out for very low count, very tiny spatial units of analysis. I would suggest offhand to only do this analysis if the spatial units of analysis under study have an average of at least 10 crimes per area in the pre time period. Which is right about on the mark for the precinct analysis in NYC.

Here is the data and R code to follow along, below I will give a walkthrough.

Crime increase dispersion analysis in R

So first as some front matter, I load in my libraries (Jerry’s crimedispersion you can install from github via devtools, see his page for an example), and the function I define here I’ve gone over in a prior blog post of mine as well.

###############################
library(ggplot2)
library(crimedispersion)

# Increase contours, see https://andrewpwheeler.com/2020/02/21/some-additional-plots-to-go-with-crime-increase-dispersion/
make_cont <- function(pre_crime,post_crime,levels=c(-3,0,3),lr=10,hr=max(pre_crime)*1.05,steps=1000){
    #calculating the overall crime increase
    ov_inc <- sum(post_crime)/sum(pre_crime)
    #Making the sequence on the square root scale
    gr <- seq(sqrt(lr),sqrt(hr),length.out=steps)^2
    cont_data <- expand.grid(gr,levels)
    names(cont_data) <- c('x','levels')
    cont_data$inc <- cont_data$x*ov_inc
    cont_data$lines <- cont_data$inc + cont_data$levels*sqrt(cont_data$inc)
    return(as.data.frame(cont_data))
}

my_dir <- 'D:\\Dropbox\\Dropbox\\Documents\\BLOG\\NYPD_ShootingIncrease\\Analysis'
setwd(my_dir)
###############################

Now we are ready to import our data and stack them into a new data frame. (These are individual incident level shootings, not aggregated. If I ever get around to it I will do an analysis of fatality and distance to emergency rooms like I did with the Philly data.)

###############################
# Get the NYPD data and stack it
# From https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Year-To-Date-/5ucz-vwe8
# And https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Historic-/833y-fsy8
# On 2/1/2021
old <- read.csv('NYPD_Shooting_Incident_Data__Historic_.csv', stringsAsFactors=FALSE)
new <- read.csv('NYPD_Shooting_Incident_Data__Year_To_Date_.csv', stringsAsFactors=FALSE)

# Just one column off
print( cbind(names(old), names(new)) )
names(new) <- names(old)
shooting <- rbind(old,new)
###############################

Now we just want to do aggregate counts of these shootings per year and per precinct. So first I substring out the year, then use table to get aggregate counts in R, then make my nice time series graph using ggplot.

###############################
# Create the current year and aggregate
shooting$Year <- substr(shooting$OCCUR_DATE, 7, 10)
year_stats <- as.data.frame(table(shooting$Year))
year_stats$Year <- as.numeric(as.character(year_stats$Var1))
year_plot <- ggplot(data=year_stats, aes(x=Year,y=Freq)) + 
             geom_line(size=1) + geom_point(shape=21, colour='white', fill='black', size=4) +
             scale_y_continuous(breaks=seq(900,2100,by=100)) +
             scale_x_continuous(breaks=2006:2020) +
             theme(axis.title.x=element_blank(), axis.title.y=element_blank(),
                   panel.grid.minor = element_blank()) + 
             ggtitle("NYPD Shootings per Year")

year_plot
# Not quite the same as Petes, https://copinthehood.com/shooting-in-nyc-2020/
###############################

Part of the reason I do this is not because I don’t trust Pete’s analysis, but because I don’t want to embed pictures from someone elses website! So wanted to recreate the time series graph myself. So next up we need to do the same aggregating, but not for the whole city, but by each precinct. You can use the same table method again, but simply pass in additional columns. That gets you the data in long format, so then I reshape it to wide for later analysis (so each row is a single precinct and each column is a yearly count of shootings). (Note there have been some splits in precincts over the years IIRC, I don’t worry about that here, will cause it to be 0,0 in the 2019/2020 data I look at.)

###############################
#Now aggregating to year and precinct
counts <- as.data.frame(table(shooting$Year, shooting$PRECINCT))
names(counts) <- c('Year','PCT','Count')
# Reshape long to wide
count_wide <-  reshape(counts, idvar = "PCT", timevar = "Year", direction = "wide")
###############################

And now we can give Jerry’s package a test run, where you just pass it your variable names.

# Jerrys function for crime increase dispersion
output <- crimedispersion(count_wide, 'PCT', 'Count.2019', 'Count.2020')
output

The way to understand this is in a hypothetical world in which we could reduce shootings in one precinct at a time, we would need to reduce shootings in 57 of the 77 precincts to reduce 2020 shootings to 2019 levels. So this suggests very widespread increases, it isn’t just concentrated among a few precincts.

Another graph I have suggested to explore this, while taking into account the typical variance with Poisson count data, is to plot the pre crime counts on the X axis, and the post crime counts on the Y axis.

###############################
# My example contour with labels
cont_lev <- make_cont(count_wide$Count.2019, count_wide$Count.2020, lr=5)

eq_plot <- ggplot() + 
           geom_line(data=cont_lev, color="darkgrey", linetype=2, 
                     aes(x=x,y=lines,group=levels)) +
           geom_point(data=count_wide, shape = 21, colour = "black", fill = "grey", size=2.5, 
                      alpha=0.8, aes(x=Count.2019,y=Count.2020)) +
           scale_y_continuous(breaks=seq(0,140,by=10) +
           scale_x_continuous(breaks=seq(0,70,by=5)) +
           coord_cartesian(ylim = c(0, 140)) +
           xlab("2019 Shootings Per Precinct") + ylab("2020 Shootings")
eq_plot
###############################

The contour lines show the hypothesis that crime increased (by around 100% here). So if a point is near the middle line, it follows that doubled mark almost exactly. The upper/lower lines indicate the typical variance, which is a very good fit to the data here you can see. Very few points are outside the boundaries.

Both of these analyses point to the fact that shooting increases were widespread across NYC precincts. Pretty much everywhere doubled in the number of shootings, it is just some places had a larger baseline to double than others (and the data has some noise, you can pick out some places that did not increase if you cherry pick the data).

And as a final R note, if you want to save these graphs as a nice high resolution PNG, here is an example with Jerry’s dispersion object:

# Saving dispersion plot as a high res PNG
png(file = "ODI.png", bg = "transparent", height=5, width=9, units="in", res=1000, type="cairo")
output #this is the object from Jerrys crimedispersion() function earlier
dev.off()

Going forward I am wondering if there is a good way to do spatial monitoring for crime data like this, like some sort of control chart that takes into account both space and time. So isn’t retrospective a year later recap, but in near real time identify spatial increases.

Other References of Interest

Justin Nix & company have a few blog posts looking at NYC data as well. In the first they talk about the variance in cities, many are up but several are down as well in violence. A later post though updated with the clear increase in shootings in NYC.
There are too many papers at this point for me to do a bibliography of all the Covid and crime updates, but two open examples are Matt Ashby did a paper on several US cities, and Campedelli et al have an analysis of Chicago. Each show variance again, so no universal up or down in trends, but various examples of increases or decreases both between cities and between different crime types within a city.

3 Comments

by Andy Wheeler on February 2, 2021 • Permalink

Posted in Crime Analysis, Criminal Justice, ggplot2, R

Tagged NYC, Poisson, shootings

Posted by Andy Wheeler on February 2, 2021

https://andrewpwheeler.com/2021/02/02/the-spatial-dispersion-of-nyc-shootings-in-2020/

Making aoristic density maps in R

I saw Jerry the other day made/updated an R package to do aoristic analysis. A nice part of this is that it returns the weights breakdown for individual cases, which you can then make maps of. My goto hot spot map for data visualization, kernel density maps, are a bit tough to work with weighted data though in R (tough is maybe not the right word, to use ggplot it takes a bit of work leveraging other packages). So here are some notes on that.

I have provided the data/code here. It is burglaries in Dallas, specifically I filter out just for business burglaries.

R Code Snippet

First, for my front end I load the libraries I will be using, and change the working directory to where my data is located.

############################
library(aoristic) #aoristic analysis 
library(rgdal)    #importing spatial data
library(spatstat) #weighted kde
library(raster)   #manipulate raster object
library(ggplot2)  #for contour graphs
library(sf)       #easier to plot sf objects

my_dir <- "D:\\Dropbox\\Dropbox\\Documents\\BLOG\\aoristic_maps_R\\data_analysis"
setwd(my_dir)
############################

Next I just have one user defined function, this takes an input polygon (the polygon that defines the borders of Dallas here), and returns a raster grid covering the bounding box. It also have an extra data field, to say whether the grid cell is inside/outside of the boundary. (This is mostly convenient when creating an RTM style dataset to make all the features conform to the same grid cells.)

###########################
#Data Manipulation Functions

#B is border, g is size of grid cell on one side
BaseRaster <- function(b,g){
    base_raster <- raster(ext = extent(b), res=g)
    projection(base_raster) <- crs(b)
    mask_raster <- rasterize(b, base_raster, getCover=TRUE) #percentage of cover, 0 is outside
    return(mask_raster)
}
###########################

The next part I grab the datasets I will be using, a boundary file for Dallas (in which I chopped off the Lochs, so will not be doing an analysis of boat house burglaries today), and then the crime data. R I believe you always have to convert date-times when reading from a CSV (it never smartly infers that a column is date/time). And then I do some other data fiddling – Jerry has a nice function to check and make sure the date/times are all in order, and then I get rid of points outside of Dallas using the sp over function. Finally the dataset is for both residential/commercial, but I just look at the commercial burglaries here.

###########################
#Get the datasets

#Geo data
boundary <- readOGR(dsn="Dallas_MainArea_Proj.shp",layer="Dallas_MainArea_Proj")
base_Dallas <- BaseRaster(b=boundary,g=200) 
base_df <- as.data.frame(base_Dallas,long=TRUE,xy=TRUE)

#Crime Data
crime_dat <- read.csv('Burglary_Dallas.csv', stringsAsFactors=FALSE)
#prepping time fields
crime_dat$Beg <- as.POSIXct(crime_dat$StartingDateTime, format="%m/%d/%Y %H:%M:%OS")
crime_dat$End <- as.POSIXct(crime_dat$EndingDateTime, format="%m/%d/%Y %H:%M:%OS")

#cleaning up data
aor_check <- aoristic.datacheck(crime_dat, 'XCoordinate', 'YCoordinate', 'Beg', 'End')
coordinates(crime_dat) <- crime_dat[,c('XCoordinate', 'YCoordinate')]
crs(crime_dat) <- crs(boundary)
over_check <- over(crime_dat, boundary)
keep_rows <- (aor_check$aoristic_datacheck == 0) & (!is.na(over_check$city))
crime_dat_clean <- crime_dat[keep_rows,]

#only look at business burgs to make it go abit faster
busi_burgs <- crime_dat_clean[ crime_dat_clean$UCROffense == 'BURGLARY-BUSINESS', ]
###########################

The next part preps the aoristic weights. First, the aoristic.df function is from Jerry’s aoristic package. It returns the weights broken down by 168 hours per day of the week. Here I then just collapse across the weekdays into the same hour, which to do that is simple, just add up the weights.

After that it is some more geographic data munging using the spatstat package to do the heavy lifting for the weighted kernel density estimate, and then stuffing the result back into another data frame. My bandwidth here, 3000 feet, is a bit large but makes nicer looking maps. If you do this smaller you will have a more bumpy and localized hot spots in the kernel density estimate.

###########################
#aoristic weights

#This takes like a minute
res_weights <- aoristic.df(busi_burgs@data, 'XCoordinate', 'YCoordinate', 'Beg', 'End')

#Binning into same hourly bins
for (i in 1:24){
    cols <- (0:6*24)+i+5
    lab <- paste0("Hour",i)
    res_weights[,c(lab)] <- rowSums(res_weights[,cols])
}

#Prepping the spatstat junk I need
peval <- rasterToPoints(base_Dallas)[,1:2]
spWin <- as.owin(as.data.frame(peval))
sp_ppp <- as.ppp(res_weights[,c('x_lon','y_lat')],W=spWin) #spp point pattern object

#Creating a dataframe with all of the weighted KDE
Hour_Labs <- paste0("Hour",1:24)

for (h in Hour_Labs){
  sp_den <- density.ppp(sp_ppp,weights=res_weights[,c(h)],
                        sigma=3000,
                        edge=FALSE,warnings=FALSE)
  sp_dat <- as.data.frame(sp_den)
  kd_raster <- rasterFromXYZ(sp_dat,res=res(base_Dallas),crs=crs(base_Dallas))
  base_df[,c(h)] <- as.data.frame(kd_raster,long=TRUE)$value
}
###########################

If you are following along, you may be wondering why all the hassle? It is partly because I want to use ggplot to make maps, but for its geom_contour it does not except weights, so I need to do the data manipulation myself to supply ggplot the weighted data in the proper format.

First I turn my Dallas boundary into a simple feature sf object, then I create my filled contour graph, supplying the regular grid X/Y and the Z values for the first Hour of the day (so between midnight and 1 am).

###########################
#now making contour graphs

dallas_sf <- st_as_sf(boundary)

#A plot for one hour of the day
hour1 <- ggplot() + 
  geom_contour_filled(data=base_df, aes(x, y, z = Hour1), bins=9) +
  geom_sf(data=dallas_sf, fill=NA, color='black') +
  scale_fill_brewer(palette="Greens") +
  ggtitle('       Hour [0-1)') + 
  theme_void() + theme(legend.position = "none")
hour1

png('Hour1.png', height=5, width=5, units="in", res=1000, type="cairo") 
hour1
dev.off()
###########################

Nice right! I have in the code my attempt to make a super snazzy small multiple plot, but that was not working out so well for me. But you can then go ahead and make up other slices if you want. Here is an example of taking an extended lunchtime time period.

###########################
#Plot for the afternoon time period
base_df$Afternoon <- rowSums(base_df[,paste0("Hour",10:17)])

afternoon <- ggplot() + 
  geom_contour_filled(data=base_df, aes(x, y, z = Afternoon), bins=9) +
  geom_sf(data=dallas_sf, fill=NA, color='black') +
  scale_fill_brewer(palette="Greens") +
  ggtitle('       Hour [9:00-17:00)') + 
  theme_void() + theme(legend.position = "none")
afternoon
###########################

So you can see that the patterns only slightly changed compared to the midnight prior graph.

Note that these plots will have different breaks, but you could set them to be equal by simply specifying a breaks argument in the geom_contour_filled call.

I will leave it up so someone who is more adept at R code than me to make a cool animated viz over time from this. But that is a way to mash up the temporal weights in a map.

Making a hexbin map in ggplot

In a recent working paper I made a hexbin map all in R. (Gio did most of the hard work of data munging and modeling though!) Figured I would detail the process here for some notes. Hexagon binning is purportedly better than regular squares (to avoid artifacts of runs in discretized data). But the reason I use them in this circumstance is mostly just an aesthetic preference.

Two tricky parts to this: 1) making the north arrow and scale bar, and 2) figuring out the dimensions to make regular hexagons. As an illustration I use the shooting victim data from Philly (see the working paper for all the details) full data and code to replicate here. I will walk through a bit of it though.

Data Prep

First to start out, I just use these three libraries, and set the working directory to where my data is.

library(ggplot2)
library(rgdal)
library(proj4)
setwd('C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\HexagonMap_ggplot\\Analysis')

Now I read in the Philly shooting data, and then an outline of the city that is projected. Note I read in the shapefile data using rgdal, which imports the projection info. I need that to be able to convert the latitude/longitude spherical coordinates in the shooting data to a local projection. (Unless you are making a webmap, you pretty much always want to use some type of local projection, and not spherical coordinates.)

#Read in the shooting data
shoot <- read.csv('shootings.csv')
#Get rid of missing
shoot <- shoot[!is.na(shoot$lng),c('lng','lat')]
#Read in the Philly outline
PhilBound <- readOGR(dsn="City_Limits_Proj.shp",layer="City_Limits_Proj")
#Project the Shooting data
phill_pj <- proj4string(PhilBound)
XYMeters <- proj4::project(as.matrix(shoot[,c('lng','lat')]), proj=phill_pj)
shoot$x <- XYMeters[,1]
shoot$y <- XYMeters[,2]

Making a Basemap

It is a bit of work to make a nice basemap in R and ggplot, but once that upfront work is done then it is really easy to make more maps. To start, the GISTools package has a set of functions to get a north arrow and scale bar, but I have had trouble with them. The ggsn package imports the north arrow as a bitmap instead of vector, and I also had a difficult time with its scale bar function. (I have not figured out the cartography package either, I can’t keep up with all the mapping stuff in R!) So long story short, this is my solution to adding a north arrow and scale bar, but I admit better solutions probably exist.

So basically I just build my own polygons and labels to add into the map where I want. Code is motivated based on the functions in GISTools.

#creating north arrow and scale bar, motivation from GISTools package
arrow_data <- function(xb, yb, len) {
  s <- len
  arrow.x = c(0,0.5,1,0.5,0) - 0.5
  arrow.y = c(0,1.7  ,0,0.5,0)
  adata <- data.frame(aX = xb + arrow.x * s, aY = yb + arrow.y * s)
  return(adata)
}

scale_data <- function(llx,lly,len,height){
  box1 <- data.frame(x = c(llx,llx+len,llx+len,llx,llx),
                     y = c(lly,lly,lly+height,lly+height,lly))
  box2 <- data.frame(x = c(llx-len,llx,llx,llx-len,llx-len),
                     y = c(lly,lly,lly+height,lly+height,lly))
  return(list(box1,box2))
}

x_cent <- 830000
len_bar <- 3000
offset_scaleNum <- 64300
arrow <- arrow_data(xb=x_cent,yb=67300,len=2500)
scale_bxs <- scale_data(llx=x_cent,lly=65000,len=len_bar,height=750)

lab_data <- data.frame(x=c(x_cent, x_cent-len_bar, x_cent, x_cent+len_bar, x_cent),
                       y=c( 72300, offset_scaleNum, offset_scaleNum, offset_scaleNum, 66500),
                       lab=c("N","0","3","6","Kilometers"))

This is about the best I have been able to automate the creation of the north arrow and scale bar polygons, while still having flexibility where to place the labels. But now we have all of the ingredients necessary to make our basemap. Make sure to use coord_fixed() for maps! Also for background maps I typically like making the outline thicker, and then have borders for smaller polygons lighter and thinner to create a hierarchy. (If you don’t want the background map to have any color, use fill=NA.)

base_map <- ggplot() + 
            geom_polygon(data=PhilBound,size=1.5,color='black', fill='darkgrey', aes(x=long,y=lat)) +
            geom_polygon(data=arrow, fill='black', aes(x=aX, y=aY)) +
            geom_polygon(data=scale_bxs[[1]], fill='grey', color='black', aes(x=x, y = y)) + 
            geom_polygon(data=scale_bxs[[2]], fill='white', color='black', aes(x=x, y = y)) + 
            geom_text(data=lab_data, size=4, aes(x=x,y=y,label=lab)) +
            coord_fixed() + theme_void()

#Check it out           
base_map

This is what it looks like on my windows machine in RStudio — it ends up looking alittle different when I export the figure straight to PNG though. Will get to that in a minute.

Making a hexagon map

Now you have your basemap you can superimpose whatever other data you want. Here I wanted to visualize the spatial distribution of shootings in Philly. One option is a kernel density map. I tend to like aggregated count maps though better for an overview, since I don’t care so much for drilling down and identifying very specific hot spots. And the counts are easier to understand than densities.

In geom_hex you can supply a vertical and horizontal parameter to control the size of the hexagon — supplying the same for each does not create a regular hexagon though. The way the hexagon is oriented in geom_hex the vertical parameter is vertex to vertex, whereas the horizontal parameter is side to side.

Here are three helper functions. First, wd_hex gives you a horizontal width length given the vertical parameter. So if you wanted your hexagon to be vertex to vertex to be 1000 meters (so a side is 500 meters), wd_hex(1000) returns just over 866. Second, if for your map you wanted to convert the numbers to densities per unit area, you can use hex_area to figure out the size of your hexagon. Going again with our 1000 meters vertex to vertex hexagon, we have a total of hex_area(1000/2) is just under 650,000 square meters (or about 0.65 square kilometers).

For maps though, I think it makes the most sense to set the hexagon to a particular area. So hex_dim does that. If you want to set your hexagons to a square kilometer, given our projected data is in meters, we would then just do hex_dim(1000^2), which with rounding gives us vert/horz measures of about (1241,1075) to supply to geom_hex.

#ggplot geom_hex you need to supply height and width
#if you want a regular hexagon though, these
#are not equal given the default way geom_hex draws them
#https://www.varsitytutors.com/high_school_math-help/how-to-find-the-area-of-a-hexagon

#get width given height
wd_hex <- function(height){
  tri_side <- height/2
  sma_side <- height/4
  width <- 2*sqrt(tri_side^2 - sma_side^2)
  return(width)
}

#now to figure out the area if you want
#side is simply height/2 in geom_hex
hex_area <- function(side){
  area <- 6 * (  (sqrt(3)*side^2)/4 )
  return(area)
}

#So if you want your hexagon to have a regular area need the inverse function
#Gives height and width if you want a specific area
hex_dim <- function(area){
  num <- 4*area
  den <- 6*sqrt(3)
  vert <- 2*sqrt(num/den)
  horz <- wd_hex(height)
  return(c(vert,horz))
}

my_dims <- hex_dim(1000^2)   #making it a square kilometer
sqrt(hex_area(my_dims[1]/2)) #check to make sure it is square km
#my_dims also checks out with https://hexagoncalculator.apphb.com/

Now onto the good stuff. I tend to think discrete bins make nicer looking maps than continuous fills. So through some trial/error you can figure out the best way to make those via cut. Also I make the outlines for the hexagons thin and white, and make the hexagons semi-transparent. So you can see the outline for the city. I like how by default areas with no shootings are not given any hexagon.

lev_cnt <- seq(0,225,25)
shoot_count <- base_map + 
               geom_hex(data=shoot, color='white', alpha=0.85, size=0.1, binwidth=my_dims, 
                        aes(x=x,y=y,fill=cut(..count..,lev_cnt))) + 
               scale_fill_brewer(name="Count Shootings", palette="OrRd")

We have come so far, now to automate exporting the figure to a PNG file. I’ve had trouble getting journals recently to not bungle vector figures that I forward them, so I am just like going with high res PNG to avoid that hassle. If you render the figure and use the GUI to export to PNG, it won’t be as high resolution, so you can often easily see aliasing pixels (e.g. the pixels in the North Arrow for the earlier base map image).

png('Philly_ShootCount.png', height=5, width=5, units="in", res=1000, type="cairo") 
shoot_count
dev.off()

Note the font size/location in the exported PNG are often not quite exactly as they are when rendered in the RGUI window or RStudio on my windows machine. So make sure to check the PNG file.

2 Comments

by Andy Wheeler on August 7, 2019 • Permalink

Posted in Crime Mapping, Data Visualization, ggplot2, R

Tagged hexagon, shootings

Posted by Andy Wheeler on August 7, 2019

https://andrewpwheeler.com/2019/08/07/making-a-hexbin-map-in-ggplot/

Making interactive plots with R and Plotly

I wrote a small op-ed based on the homicide studies work I recently published about interpreting crime trends. Unfortunately that op-ed was not picked up by anyone (I missed the timing abit, maybe next year when the UCR stats come out I can just update the numbers and make the same point). I’ve posted that op-ed here, and I wanted to make a quick blog post detailing how I made the interactive graphs in that post using R and the Plotly library. All the data and code to replicate this can be downloaded from here.

Unfortunately with my free wordpress blog I cannot embed the actual interactive graphics, but I will provide links to online versions at my UT Dallas page that work and show a screenshot of each. So first, lets load all of the libraries that you will need, as well as set the working directory. (Of course change it to where you have your data saved on your local machine.)

#########################################################
#Making a shiny app for homicide rate chart
library(shiny)
library(ggplot2)
library(plotly)
library(htmlwidgets)
library(scales)

mydir <- "C:\\Users\\axw161530\\Box Sync\\Projects\\HomicideGraphs\\Analysis\\Analysis" 
setwd(mydir)
#########################################################

Now I just read in the data. I have two datasets, the funnel rates just has additional columns to draw the funnel graphs already created. (See here or here, or the data in the original Homicide Studies paper linked at the top, on how to construct these.)

############################################################
#Get the data 

FunnRates <- read.csv(file="FunnelData.csv",header=TRUE)
summary(FunnRates)
FunnRates$Population <- FunnRates$Pop1 #These are just to make nicer labels 
FunnRates$HomicideRate <- FunnRates$HomRate

IntRates <- read.csv(file="IntGraph.csv",header=TRUE)
summary(IntRates)
############################################################

Funnel Chart for One Year

First, plotly makes it dead easy to take graphs you created via ggplot and turn them into an interactive graph. So here is a link to the interactive chart, and below is a screenshot.

To walk through the code, first you make your (almost) plane Jane ggplot object. Here I name it p. You will get an error for an “unknown aesthetics: text”, but this will be used by plotly to create tooltips. Then you use the ggplotly function to turn the original ggplot graph p into an interactive graph. By default the plotly object has more stuff in the tooltip than I want, which you can basically just go into the innards of the plotly object and strip out. Then the final part is just setting the margins to be alittle larger than default, as the axis labels were otherwise slightly cut-off.

############################################################
#Make the funnel chart
year_sel <- 2015
p <- ggplot(data = FunnRates[FunnRates$Year == year_sel,]) + geom_point(aes(x=Population, y=HomicideRate, text=NiceLab), pch=21) +
     geom_line(aes(x=Population,y=LowLoc99)) + geom_line(aes(x=Population,y=HighLoc99)) + 
     labs(title = paste0("Homicide Rates per 100,000 in ",year_sel)) + 
     scale_x_log10("Population", limits=c(10000,10000000), breaks=c(10^4,10^5,10^6,10^7), labels=comma) + 
     scale_y_continuous("Homicide Rate", breaks=seq(0,110,10)) + 
     theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))

pl <- ggplotly(p, tooltip = c("HomicideRate","text"))
#pl <- plotly_build(p, width=1000, height=900)
#See https://stackoverflow.com/questions/45801389/disable-hover-information-for-a-specific-layer-geom-of-plotly
pl$x$data[[2]]$hoverinfo <- "none"
pl$x$data[[3]]$hoverinfo <- "none"
pl % layout(margin = list(l = 75, b = 65))
############################################################

After this point you can just type pl into the console and it will open up an interactive window. Or you can use the saveWidget function from the htmlwidgets package, something like saveWidget(as_widget(pl), "FunnelChart_2015.html", selfcontained=TRUE) to save the graph to an html file.

Now there are a couple of things. You can edit various parts of the graph, such as its size and label text size, but depending on your application these might not be a good idea. If you need to take into account smaller screens, I think it is best to use some of the defaults, as they adjust per the screen that is in use. For the size of the graph if you are embedding it in a webpage using iframe’s you can set the size at that point. If you look at my linked op-ed you can see I make the funnel chart taller than wider — that is through the iframe specs.

Funnel Chart over Time

Ok, now onto the fun stuff. So we have a funnel chart for one year, but I have homicide years from 1965 through 2015. Can we examine those over time. Plotly has an easy to use additional argument to ggplot graphs, named Frame, that allows you to add a slider to the interactive chart for animation. The additional argument ids links one object over time, ala Hans Rosling bubble chart over time. Here is a link to the interactive version, and below is a screen shot:

############################################################
#Making the funnel chart where you can select the year
py <- ggplot(data = FunnRates) + geom_point(aes(x=Population, y=HomicideRate, text=NiceLab, frame=Year,ids=ORI), pch=21) +
      geom_line(aes(x=Population,y=LowLoc99,frame=Year)) + geom_line(aes(x=Population,y=HighLoc99,frame=Year)) + 
      labs(title = paste0("Homicide Rates per 100,000")) + 
      scale_x_log10("Population", limits=c(10000,10000000), breaks=c(10^4,10^5,10^6,10^7), labels=comma) + 
      scale_y_continuous("Homicide Rate", breaks=seq(0,110,10), limits=c(0,110)) + 
      theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))

ply % animation_opts(0, redraw=FALSE)
ply$x$data[[2]]$hoverinfo <- "none"
ply$x$data[[3]]$hoverinfo <- "none"
saveWidget(as_widget(ply), "FunnelChart_YearSelection.html", selfcontained=FALSE)
############################################################

The way I created the data it does not make sense to do a smooth animation for the funnel line, so this just flashes to each new year (via the animation_opts spec). (I could make the data so it would look nicer in an animation, but will wait for someone to pick up the op-ed before I bother too much more with this.) But it accomplishes via the slider the ability for you to pick which year you want.

Fan Chart Just One City

Next we are onto the fan charts for each individual city with the prediction intervals. Again you can just create this simple chart in ggplot, and then use plotly to make a version with tooltips. Here is a link to an interactive version, and below is a screenshot.

###################################################
#Making the fan graph for New Orleans
titleLab <- unique(IntRates[,c("ORI","NiceLab","AgencyName","State")])
p2 <- ggplot(data=IntRates[IntRates$ORI == "LANPD00",], aes(x=Year, y=HomRate)) + 
     geom_ribbon(aes(ymin=LowB, ymax=HighB), alpha=0.2) +
     geom_ribbon(aes(ymin=LagLow25, ymax=LagHigh25), alpha=0.5) +
     geom_point(shape=21, color="white", fill="red", size=2) +
     labs(x = "Year", y="Homicide Rate per 100,000") +
     #scale_x_continuous(breaks=seq(1960,2015,by=5)) + 
     ggtitle(paste0("Prediction Intervals for ",titleLab[titleLab$ORI == "LANPD00",c("NiceLab")])) +
     theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))
#p2
pl2 <- ggplotly(p2, tooltip = c("Year","HomRate"), dynamicTicks=TRUE)
pl2$x$data[[1]]$hoverinfo <- "none"
pl2$x$data[[2]]$hoverinfo <- "none"
pl2 % layout(margin = list(l = 100, b = 65))
#pl2
saveWidget(as_widget(pl2), "FanChart_NewOrleans.html", selfcontained=FALSE)
###################################################

Note when you save the widget to selfcontained=FALSE, it hosts several parts of the data into separate folders. I always presumed this was more efficient than making one huge html file, but I don’t know for sure.

Unfortunately the frame type animation does not make as much sense here. It would be hard for someone to find a particular city of interest in that slider (as a note though the slider can have nominal data, if I only had a few cities it would work out ok, with a few hundred it will not though). So feature request if anyone from plotly is listening — please have a dropdown type option for ggplot graphs! In the meantime though there is an alternative using a tradition plot_ly type chart. ~~Here is that interactive fan chart with a police agency dropdown, and below is a screenshot~~. Note there was problems with this, will need to update based on updates to plotly and ggplot make this easier I believe.

###################################################
#Making the fan graph where you can select the city of interest
#Need to have a dropdown for the city

titleLab <- unique(IntRates[,c("ORI","NiceLab","State")])
nORI <- length(titleLab[,1])
choiceP <- vector("list",nORI)
for (i in 1:nORI){
choiceP[[i]] <- list(method="restyle", args=list("transforms[0].value", unique(IntRates$NiceLab)[i]), label=titleLab[i,c("NiceLab")])
}

trans <- list(list(type='filter',target=~NiceLab, operation="=", value=unique(IntRates$NiceLab)[1]))
textLab <- ~paste("Homicide Rate:",HomRate,'$ Year:',Year,'$ Homicides:',Homicide,'$ Population:',Pop1,'$ Agency Name:',NiceLab) #Lets try with the default plotly #See https://community.plot.ly/t/need-help-on-using-dropdown-to-filter/6596 ply4 %          plot_ly(x= ~Year,y= ~HighB, type='scatter', mode='lines', line=list(color='transparent'), showlegend=FALSE, name="90%", hoverinfo="none", transforms=trans) %>%
        add_trace(y=~LowB,  type='scatter', mode='lines', line=list(color='transparent'), showlegend=FALSE, name='10%', hoverinfo="none", transforms=trans,
          fill = 'tonexty', fillcolor='rgba(105,105,105,0.3)') %>%
        add_trace(x=~Year,y=~HomRate, text=~NiceLab, type='scatter', mode='markers', marker = list(size=10, color = 'rgba(255, 182, 193, .9)', line = list(color = 'rgba(152, 0, 0, .8)', width = 1)),
          hoverinfo='text', text=textLab, transforms=trans) %>%
        layout(title = "Homicide Rates and 80% Prediction Intervals by Police Department",
          xaxis = list(title="Year"),
          yaxis = list(title="Homicide Rate per 100,000"),
          updatemenus=list(list(type='dropdown',active=0,buttons=choiceP)))
                               
saveWidget(as_widget(ply4), "FanChart_Dropdown.html", selfcontained=FALSE)
###################################################

So in short plotly makes it super-easy to make interactive graphs with tooltips. Long term goal I would like to make a visual supplement to the traditional UCR report (I find the complaint of what tables to include to miss the point — there are much better ways to show the information that worrying about the specific tables). So if you would like to work on that with me always feel free to get in touch!

2 Comments

by Andy Wheeler on March 25, 2018 • Permalink

Posted in Crime Analysis, Data Visualization, ggplot2, R

Tagged interactive, plotly, tool-tips

Posted by Andy Wheeler on March 25, 2018

https://andrewpwheeler.com/2018/03/25/making-interactive-plots-with-r-and-plotly/

Some plots to go with group based trajectory models in R

On my prior post on estimating group based trajectory models in R using the crimCV package I received a comment asking about how to plot the trajectories. The crimCV model object has a base plot object, but here I show how to extract those model predictions as well as some other functions. Many of these plots are illustrated in my paper for crime trajectories at micro places in Albany (forthcoming in the Journal of Quantitative Criminology). First we are going to load the crimCV and the ggplot2 package, and then I have a set of four helper functions which I will describe in more detail in a minute. So run this R code first.

library(crimCV)
library(ggplot2)

long_traj <- function(model,data){
  df <- data.frame(data)
  vars <- names(df)
  prob <- model['gwt'] #posterior probabilities
  df$GMax <- apply(prob$gwt,1,which.max) #which group # is the max
  df$PMax <- apply(prob$gwt,1,max)       #probability in max group
  df$Ord <- 1:dim(df)[1]                 #Order of the original data
  prob <- data.frame(prob$gwt)
  names(prob) <- paste0("G",1:dim(prob)[2]) #Group probabilities are G1, G2, etc.
  longD <- reshape(data.frame(df,prob), varying = vars, v.names = "y", 
                   timevar = "x", times = 1:length(vars), 
                   direction = "long") #Reshape to long format, time is x, y is original count data
  return(longD)                        #GMax is the classified group, PMax is the probability in that group
}

weighted_means <- function(model,long_data){
  G_names <- paste0("G",1:model$ng)
  G <- long_data[,G_names]
  W <- G*long_data$y                                    #Multiple weights by original count var
  Agg <- aggregate(W,by=list(x=long_data$x),FUN="sum")  #then sum those products
  mass <- colSums(model$gwt)                            #to get average divide by total mass of the weight
  for (i in 1:model$ng){
    Agg[,i+1] <- Agg[,i+1]/mass[i]
  }
  long_weight <- reshape(Agg, varying=G_names, v.names="w_mean",
                         timevar = "Group", times = 1:model$ng, 
                         direction = "long")           #reshape to long
  return(long_weight)
}
  
pred_means <- function(model){
    prob <- model$prob               #these are the model predicted means
    Xb <- model$X %*% model$beta     #see getAnywhere(plot.dmZIPt), near copy
    lambda <- exp(Xb)                #just returns data frame in long format
    p <- exp(-model$tau * t(Xb))
    p <- t(p)
    p <- p/(1 + p)
    mu <- (1 - p) * lambda
    t <- 1:nrow(mu)
    myDF <- data.frame(x=t,mu)
    long_pred <- reshape(myDF, varying=paste0("X",1:model$ng), v.names="pred_mean",
                         timevar = "Group", times = 1:model$ng, direction = "long")
    return(long_pred)
}

#Note, if you estimate a ZIP model instead of the ZIP-tau model
#use this function instead of pred_means
pred_means_Nt <- function(model){
    prob <- model$prob               #these are the model predicted means
    Xb <- model$X %*% model$beta     #see getAnywhere(plot.dmZIP), near copy
    lambda <- exp(Xb)                #just returns data frame in long format
	Zg <- model$Z %*% model$gamma
    p <- exp(Zg)
    p <- p/(1 + p)
    mu <- (1 - p) * lambda
    t <- 1:nrow(mu)
    myDF <- data.frame(x=t,mu)
    long_pred <- reshape(myDF, varying=paste0("X",1:model$ng), v.names="pred_mean",
                         timevar = "Group", times = 1:model$ng, direction = "long")
    return(long_pred)
}

occ <- function(long_data){
 subdata <- subset(long_data,x==1)
 agg <- aggregate(subdata$PMax,by=list(group=subdata$GMax),FUN="mean")
 names(agg)[2] <- "AvePP" #average posterior probabilites
 agg$Freq <- as.data.frame(table(subdata$GMax))[,2]
 n <- agg$AvePP/(1 - agg$AvePP)
 p <- agg$Freq/sum(agg$Freq)
 d <- p/(1-p)
 agg$OCC <- n/d #odds of correct classification
 agg$ClassProp <- p #observed classification proportion
 #predicted classification proportion
 agg$PredProp <- colSums(as.matrix(subdata[,grep("^[G][0-9]", names(subdata), value=TRUE)]))/sum(agg$Freq) 
 #Jeff Ward said I should be using PredProb instead of Class prop for OCC
 agg$occ_pp <- n/ (agg$PredProp/(1-agg$PredProp))
 return(agg)
}

Now we can just use the data in the crimCV package to run through an example of a few different types of plots. First lets load in the TO1adj data, estimate the group based model, and make our base plot.

data(TO1adj)
out1 <-crimCV(TO1adj,4,dpolyp=2,init=5)
plot(out1)

Now most effort seems to be spent on using model selection criteria to pick the number of groups, what may be called relative model comparisons. Once you pick the number of groups though, you should still be concerned with how well the model replicates the data at hand, e.g. absolute model comparisons. The graphs that follow help assess this. First we will use our helper functions to make three new objects. The first function, long_traj, takes the original model object, out1, as well as the original matrix data used to estimate the model, TO1adj. The second function, weighted_means, takes the original model object and then the newly created long_data longD. The third function, pred_means, just takes the model output and generates a data frame in wide format for plotting (it is the same underlying code for plotting the model).

longD <- long_traj(model=out1,data=TO1adj)
x <- weighted_means(model=out1,long_data=longD)
pred <- pred_means(model=out1)

We can subsequently use the long data longD to plot the individual trajectories faceted by their assigned groups. I have an answer on cross validated that shows how effective this small multiple design idea can be to help disentangle complicated plots.

#plot of individual trajectories in small multiples by group
p <- ggplot(data=longD, aes(x=x,y=y,group=Ord)) + geom_line(alpha = 0.1) + facet_wrap(~GMax)
p

Plotting the individual trajectories can show how well they fit the predicted model, as well as if there are any outliers. You could get more fancy with jittering (helpful since there is so much overlap in the low counts) but just plotting with a high transparency helps quite abit. This second graph plots the predicted means along with the weighted means. What the weighted_means function does is use the posterior probabilities of groups, and then calculates the observed group averages per time point using the posterior probabilities as the weights.

#plot of predicted values + weighted means
p2 <- ggplot() + geom_line(data=pred, aes(x=x,y=pred_mean,col=as.factor(Group))) + 
                 geom_line(data=x, aes(x=x,y=w_mean,col=as.factor(Group))) + 
                geom_point(data=x, aes(x=x,y=w_mean,col=as.factor(Group)))
p2

Here you can see that the estimated trajectories are not a very good fit to the data. Pretty much eash series has a peak before the predicted curve, and all of the series except for 2 don’t look like very good candidates for polynomial curves.

It ends up that often the weighted means are very nearly equivalent to the unweighted means (just aggregating means based on the classified group). In this example the predicted values are a colored line, the weighted means are a colored line with superimposed points, and the non-weighted means are just a black line.

#predictions, weighted means, and non-weighted means
nonw_means <- aggregate(longD$y,by=list(Group=longD$GMax,x=longD$x),FUN="mean")
names(nonw_means)[3] <- "y"

p3 <- p2 + geom_line(data=nonw_means, aes(x=x,y=y), col='black') + facet_wrap(~Group)
p3

You can see the non-weighted means are almost exactly the same as the weighted ones. For group 3 you typically need to go to the hundredths to see a difference.

#check out how close
nonw_means[nonw_means$Group==3,'y'] -  x[x$Group==3,'w_mean']

You can subsequently superimpose the predicted group means over the individual trajectories as well.

#superimpose predicted over ind trajectories
pred$GMax <- pred$Group
p4 <- ggplot() + geom_line(data=pred, aes(x=x,y=pred_mean), col='red') + 
                 geom_line(data=longD, aes(x=x,y=y,group=Ord), alpha = 0.1) + facet_wrap(~GMax)
p4

Two types of absolute fit measures I’ve seen advocated in the past are the average maximum posterior probability per group and the odds of correct classification. The occ function calculates these numbers given two vectors (one of the max probabilities and the other of the group classifications). We can get this info from our long data by just selecting a subset from one time period. Here the output at the console shows that we have quite large average posterior probabilities as well as high odds of correct classification. (Also updated to included the observed classified proportions and the predicted proportions based on the posterior probabilities. Again, these all show very good model fit.) Update: Jeff Ward sent me a note saying I should be using the predicted proportion in each group for the occ calculation, not the assigned proportion based on the max. post. prob. So I have updated to include the occ_pp column for this, but left the old occ column in as a paper trail of my mistake.

occ(longD)
#  group     AvePP Freq        OCC  ClassProp   PredProp     occ_pp
#1     1 0.9880945   23 1281.00444 0.06084656 0.06298397 1234.71607
#2     2 0.9522450   35  195.41430 0.09259259 0.09005342  201.48650
#3     3 0.9567524   94   66.83877 0.24867725 0.24936266   66.59424
#4     4 0.9844708  226   42.63727 0.59788360 0.59759995   42.68760

A plot to accompany this though is a jittered dot plot showing the maximum posterior probability per group. You can here that groups 3 and 4 are more fuzzy, whereas 1 and 2 mostly have very high probabilities of group assignment.

#plot of maximum posterior probabilities
subD <- longD[x==1,]
p5 <- ggplot(data=subD, aes(x=as.factor(GMax),y=PMax)) + geom_point(position = "jitter", alpha = 0.2)
p5

Remember that these latent class models are fuzzy classifiers. That is each point has a probability of belonging to each group. A scatterplot matrix of the individual probabilities will show how well the groups are separated. Perfect separation into groups will result in points hugging along the border of the graph, and points in the middle suggest ambiguity in the class assignment. You can see here that each group closer in number has more probability swapping between them.

#scatterplot matrix
library(GGally)
sm <- ggpairs(data=subD, columns=4:7)
sm

And the last time series plot I have used previously is a stacked area chart.

#stacked area chart
nonw_sum <- aggregate(longD$y,by=list(Group=longD$GMax,x=longD$x),FUN="sum")
names(nonw_sum)[3] <- "y"
p6 <- ggplot(data=nonw_sum, aes(x=x,y=y,fill=as.factor(Group))) + geom_area(position='stack')
p6

I will have to put another post in the queue to talk about the spatial point pattern tests I used in that trajectory paper for the future as well.

36 Comments

by Andy Wheeler on September 29, 2015 • Permalink

Posted in Data Visualization, ggplot2, R

Tagged ggplot2, r, trajectory

Posted by Andy Wheeler on September 29, 2015

https://andrewpwheeler.com/2015/09/29/some-plots-to-go-with-group-based-trajectory-models-in-r/

Custom square root scale (with negative values) in ggplot2 (R)

My prior rootogram post Jon Peck made the astute comment that rootograms typically are plotted on a square root scale. (Which should of been obvious to me given the name!) The reason for a square root scale for rootograms is visualization purposes, the square root scale gives more weight to values nearby 0 and shrinks values farther away from 0.

SPSS can not have negative values on a square root scale, but you can make a custom scale using ggplot2 and the scales package in R for this purpose. Here I just mainly replicated this short post by Paul Hiemstra.

So in R, first we load the scales and the ggplot2 package, and then create our custom scale function. Obviously the square root of a negative value is not defined for real numbers, so what we do is make a custom square root function. The function simply takes the square root of the absolute value, and then multiplies by the sign of the original value. This function I name S_sqrt (for signed square root). We also make its inverse function, which is named IS_sqrt. Finally I make a third function, S_sqrt_trans, which is the one used by the scales package.

library(scales)
library(ggplot2)

S_sqrt <- function(x){sign(x)*sqrt(abs(x))}
IS_sqrt <- function(x){x^2*sign(x)}
S_sqrt_trans <- function() trans_new("S_sqrt",S_sqrt,IS_sqrt)

Here is a quick example data set in R to work with.

#rootogram example, see http://stats.stackexchange.com/q/140473/1036
MyText <- textConnection("
Dist Val1 Val2
1 0.03 0.04
2 0.12 0.15
3 0.45 0.50
4 0.30 0.24 
5 0.09 0.04 
6 0.05 0.02
7 0.01 0.01
")
MyData <- read.table(MyText,header=TRUE)
MyData$Hang <- MyData$Val1 - MyData$Val2

And now we can make our plots in ggplot2. First the linear scale, and second update our plot to the custom square root scale.

p <- ggplot(data=MyData, aes(x = as.factor(Dist), ymin=Hang, ymax=Val1)) + 
     geom_hline(aes(yintercept=0)) + geom_linerange(size=5) + theme_bw()
p

p2 <- p + scale_y_continuous(trans="S_sqrt",breaks=seq(-0.1,0.5,0.05), name="Density")
p2

1 Comment

by Andy Wheeler on July 31, 2015 • Permalink

Posted in Data Visualization, ggplot2, R

Tagged data visualization, ggplot2, r

Posted by Andy Wheeler on July 31, 2015

https://andrewpwheeler.com/2015/07/31/custom-square-root-scale-with-negative-values-in-ggplot2-r/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 390 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category ggplot2

Nerd Notes

References

Crime increase dispersion analysis in R

Other References of Interest

R Code Snippet

Data Prep

Making a Basemap

Making a hexagon map

Funnel Chart for One Year

Funnel Chart over Time

Fan Chart Just One City

Fan Chart with Dropdown Selection

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange