All posts tagged data visualization

Some notes on self-publishing a tech book

So my book, Data Science for Crime Analysis with Python, is finally out for purchase on my Crime De-Coder website. Folks anywhere in the world can purchase a paperback or epub copy of the book. You can see this post on Crime De-Coder for a preview of the first two chapters, but I wanted to share some of my notes on self publishing. It was some work, but in retrospect it was worth it. Prior books I have been involved with (Wheeler 2017; Wheeler et al. 2021) plus my peer review experience I knew I did not need help copy-editing, so the notes are mostly about creating the physical book and logistics of selling it.

Academics may wish to go with a publisher for prestige reasons (I get it, I was once a professor as well). But it is quite nice once you have done the legwork to publish it yourself. You have control of pricing, and if you want to make money you can, or have it cheap/free for students.

Here I will detail some of the set up of compiling the book, and then the bit of work to distribute it.

Compiling the documents

So the way I compiled the book is via Quarto. I posted my config notes on how to get the book contents to look how I wanted on GitHub. Quarto is meant to run code at the same time (so works nicely for a learning to code book). But even if I just wanted a more typical science/tech book with text/images/equations, I would personally use Quarto since I am familiar with the set up at this point. (If you do not need to run dynamic code you could do it in Pandoc directly, not sure if there is a way to translate a Quarto yaml config to the equivalent Pandoc code it turns into.)

One thing that I think will interest many individuals – you write in plain text markdown. So my writing looks like:

# Chapter Heading

blah, blah blah

## Subheading

Cool stuff here ....

In a series of text files for each chapter of the book. And then I tell Quarto quarter render, and it turns my writing in those text files into both an Epub and a PDF (and other formats if you cared, such as word or html). You can set up the configuration for the book to be different for the different formats (for example I use different fonts in the PDF vs the epub, nice fonts in one look quite bad in the other). See the _quarto.yml file for the set up, in particular config options that are different for both PDF and Epub.

One thing is that ebooks are hard to format nicely – if I had a book I wanted to redo to be an epub, I would translate it to markdown. There are services online that will translate, they will do a bad job though with scientific texts with many figures (and surely will not help you choose nice fonts). So just learn markdown to translate. Folks who write in one format and save to the other (either Epub/HTML to PDF, or PDF to Epub/HTML) are doing it wrong and the translated format will look very bad. Most advice online is for people who have just books with just text, so science people with figures (and footnotes, citations, hyperlinks, equations, etc.) it is almost all bad advice.

So even for qualitative people, learning how to write in markdown to self-publish is a good skill to learn in my opinion.

Setting up the online store

For awhile I have been confused how SaaS companies offer payment plans. (Many websites just seem to copy from generic node templates.) Looking at the Stripe API it just seems over the top for me to script up all of my own solution to integrate Stripe directly. If I wanted to do a subscription I may need to figure that out, but it ended up being for my Hostinger website I can set up a sub-page that is WordPress (even though the entire website is not), and turn on WooCommerce for that sub-page.

WooCommerce ends up being easy, and you can set up the store to host web-assets to download on demand (so when you purchase it generates a unique URL that obfuscates where the digital asset is saved). No programming involved to set up my webstore, it was all just point and click to set things up one time and not that much work in the end.

I am not sure about setting up any DRM for the epub (so in reality people will purchase epub and share it illegally). I don’t know of a way to prevent this without using Amazon+Kindle to distribute the book. But the print book should be OK. (If there were a way for me to donate a single epub copy to all libraries in the US I would totally do that.)

I originally planned on having it on Amazon, but the low margins on both plus the formatting of their idiosyncratic kindle book format (as far as I can tell, I cannot really choose my fonts) made me decide against doing either the print or ebook on Amazon.

Print on Demand using LuLu

For print on demand, I use LuLu.com. They have a nice feature to integrate with WooCommerce, the only thing I wish shipping was dynamically calculated. (I need to make a flat shipping rate for different areas around the globe the way it is set up now, slightly annoying and will change the profit margins depending on area.)

LuLu is a few more dollars to print than Amazon, but it is worth it for my circumstance I believe. Now if I had a book I expected to get many “random Amazon search buys” I could see wanting it on Amazon. I expect more sales will be via personal advertising (like here on the blog, social media, or other crime analyst events). My Crime De-Coder site (and this blog) will likely be quite high in google searches for some of the keywords fairly quickly, so who knows, maybe just having on personal site is just as many sales.

LuLu does has an option to turn on distribution to other wholesalers (like Barnes & Noble and Amazon) – have not turned that on but maybe I will in the future.

LuLu has a pricing calculator to see how much to print on their website. Paperback and basically the cheapest color option for letter sized paper (which is quite large) is just over $17 for my 310 page book (Amazon was just over $15). For folks if you are less image heavy and more text, you could get away with a smaller size book (and maybe black/white) and I suspect will be much cheaper. LuLu’s printing of this book is higher quality compared to Amazon as well (better printing of the colors and nicer stock for the paperback cover).

Another nice thing about print on demand is I can go in and edit/update the book as I see fit. No need to worry about new versions. Not sure what that exactly means for citing the work (I could always go and change it), you can’t have a static version of record and an easy way to update at the same time.

Other Random Book Stuff

I purchased ISBNs on Bowker, something like 10 ISBNs for $200. (You want a unique ISBN for each type of the book, so you may want three in the end if you have epub/paperback/hardback.) Amazon and LuLu though have options to have them give you an ISBN though, so that may have not been necessary. I set the imprint to be my LLC though in Bowker, so CRIME De-Coder is the publisher.

You don’t technically need an ISBN at all, but it is a simple thing, and there may be ways for me to donate to libraries in the future. (If a University picks it up as a class text, I have been at places you need at least one copy for rent at the Uni library.)

I have not created an index – I may have a go at feeding my book through LLMs and seeing if I can auto-generate a nice index. (I just need a list of key words, after that can just go and find-replace the relevent text in the book to fill in so it auto-compiles an index.) I am not sure that is really necessary though for a how-to book, you should just look at the table of contents to see the individual (fairly small) sections. For epub you can just doing a direct text search, so not sure if people use an index at all in epubs.

Personal Goals

So I debated on releasing the book open source, I do want to try and see if I can make some money though. I don’t have this expectation, but there is potential to get some “data science” spillover, and if that is the case sales could in theory be quite high. (I was surprised in searching the “data science python” market on Amazon, it is definitely not saturated.) Personally I will consider at least 100 sales to be my floor for success. That is if I can sell at least 100 copies, I will consider writing more books. If I can’t sell 100 copies I have a hard time justifying the effort – it would just be too few of people buying the book to have the types of positive spillovers I want.

To make back money relative to the amount of work I put in, I would need more than 1000 sales (which I think is unrealistic). I think 500 sales is about best case, guesstimating the size of the crime analyst community that may be interested plus some additional sales for grad students. 1000 sales it would need to be in the multiple professors using it for a class book over several years. (Which if you are a professor and interested in this for a class let me know, I will give your class a discount.)

Another common way for individuals to make money off of books is not for sales, but to have training’s oriented around the book. I am hoping to do more of that for crime analysts directly in the future, but those opportunities I presume will be correlated with total sales.

I do enjoy writing, but I am busy, so cannot just say “I am going to drop 200 hours writing a book”. I would like to write additional python topics oriented towards crime analysts/criminology grad students like:

GIS analysis in python
Regression
Machine Learning & Optimization
Statistics for Crime Analysis
More advanced project management in python

Having figured out much of this grunt work definitely makes me more motivated, but ultimately in the end need to have a certain level of sales to justify the effort. So please if you like the blog pick up a copy and tell a friend you like my work!

References

Wheeler, A.P. (2024) Data Science for Crime Analysis with Python. CRIME De-Coder. ISBN 979-8-9903770-1-1
Wheeler, A.P. (2017) Geospatial analytics. In SPSS® Statistics for Data Analysis and Visualization. Eds. McCormick, K. & Salcedo J., with contributions by Peck J. & Wheeler, A.P. Wiley. ISBN 978-1119003557
Wheeler, A.P., Herrmann C.R., & Block R.L. (2021). Micro-place homicide patterns in Chicago. Springer. ISBN 978-3-030-61445-4

6 Comments

by Andy Wheeler on July 2, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Personal Productivity, Python, writing

Tagged data visualization, ebook

Posted by Andy Wheeler on July 2, 2024

https://andrewpwheeler.com/2024/07/02/some-notes-on-self-publishing-a-tech-book/

Monitoring homicide trends paper published

My paper, Monitoring Volatile Homicide Trends Across U.S. Cities (with coauthor Tom Kovandzic) has just been published online in Homicide Studies. Unfortunately, Homicide Studies does not give me a link to share a free PDF like other publishers, but you can either grab the pre-print on SSRN or always just email me for a copy of the paper.

They made me convert all of the charts to grey scale :(. Here is an example of the funnel chart for homicide rates in 2015.

And here are example fan charts I generated for a few different cities.

As always if you have feedback or suggestions let me know! I posted all of the code to replicate the analysis at this link. The prediction intervals can definately be improved both in coverage and in making their length smaller, so I hope to see other researchers tackling this as well.

Leave a comment

by Andy Wheeler on November 20, 2017 • Permalink

Posted in Crime Analysis, Criminal Justice, Data Visualization, Papers, scholarly

Tagged data visualization, homicide-rates, Papers, scholarly

Posted by Andy Wheeler on November 20, 2017

https://andrewpwheeler.com/2017/11/20/monitoring-homicide-trends-paper-published/

Presentation at ASC: Crime Data Visualization for the Future

At the upcoming American Society of Criminology conference in Philadelphia I will be presenting a talk, Crime Data Visualization for the Future. Here is the abstract:

Open data is a necessary but not sufficient condition for data to be transparent. Understanding how to reduce complicated information into informative displays is an important step for those wishing to understand crime and how the criminal justice system works. I focus the talk on using simple tables and graphs to present complicated information using various examples in criminal justice. Also I describe ways to effectively evaluate the size of effects in regression models, and make black box machine learning models more interpretable.

But I have written a paper to go with the talk as well. You can download that paper here. As always, if you have feedback/suggestions let me know.

Here are some example graphs of plotting the predictions from a random forest model predicting when restaurants in Chicago will fail their inspections.

I present on Wednesday 11/15 at 11 am. You can see the full session here. Here is a quick rundown of the other papers — Marcus was the one who put together the panel.

A Future Proposal for the Model Crime Report – Marcus Felson
Crime Data Warehouses and the future of Big Data in Criminology – Martin Andresen
Can We Unify Criminal Justice Data, Like the Dutch and the Nordics? – Michael Mueller-Smith

So it should be a great set of talks.

I also signed up to present a poster, Mapping Attitudes Towards the Police at Micro Places. This is work with Albany Finn Institute folks, including Jasmine Silver, Sarah McLean, and Rob Worden. Hopefully I will have a paper to share about that soon, but for a teaser on that here is an example map from that work, showing hot spots of dissatisfaction with the police estimated via inverse distance weighting. Update: for those interested, see here for the paper and here for the poster. Stop on by Thursday to check it out!

And here is the abstract:

We demonstrate the utility of mapping community satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. In each survey, respondents provided the nearest intersection to their address. We use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city, which shows broader neighborhood patterns of satisfaction as well as small area hot spots of dissatisfaction. Our results show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime and police activity. Models predicting satisfaction with police show that local counts of violent crime are the strongest predictors of attitudes towards police, even above individual level predictors of race and age.

Leave a comment

by Andy Wheeler on November 3, 2017 • Permalink

Posted in Crime Mapping, Data Visualization, Papers, scholarly

Tagged data visualization, presentation, scholarly

Posted by Andy Wheeler on November 3, 2017

https://andrewpwheeler.com/2017/11/03/presentation-at-asc-crime-data-visualization-for-the-future/

Talk on Scholars Day – Crime in Space and Time

I will be giving a talk tomorrow (10/21/17) at Scholars Day here at UT Dallas (where we get visits from prospective students). Here is the synopsis of my talk:

Synopsis: In this lecture, Dr. Andrew Wheeler will discuss his research on the spatial and temporal patterns of crime. He will discuss whether recent homicide trends are atypical given historical data and if you can predict which neighborhoods in Dallas have the most crime. He will also discuss what to expect from an education in criminology and the social sciences in general.

I will be at JSOM 2.106 from 11 to 11:45. Here is a bit of a sneak peak. (You will also get some Han’s Rosling style animated charts of homicide trends!)

I will also discuss some of my general pro-tips for incoming college students. I will expand that into a short post next week, but if you want that advice a few days ahead come to my talk!

Leave a comment

by Andy Wheeler on October 20, 2017 • Permalink

Posted in Data Visualization, scholarly

Tagged data visualization, scholarly, talk

Posted by Andy Wheeler on October 20, 2017

https://andrewpwheeler.com/2017/10/20/talk-on-scholars-day-crime-in-space-and-time/

New working paper – Monitoring volatile homicide trends across U.S. cities

I have a new working paper out — Monitoring volatile homicide trends across U.S. cities, with one of my colleagues Tomislav Kovandzic. You can grab the pre-print on SSRN, and the paper has links to code to replicate the charts and models in the paper.

Here I look at homicide rates in U.S. cities and use funnel charts and fan charts to show the typical volatility in homicide rates between cities and within cities over time. As I’ve written previously, I think much of the media narrative around homicide increases are hyperbolic and often cherry pick reasons why they think homicides are going up.

I’ve shown examples of funnel charts on this blog before, so I will use a different image as the tease. To generate the prediction intervals for fan charts I estimate binomial random effect models. Below is an example for New Orleans (homicide rate per 100,000 population):

As always, if you have feedback feel free to send me an email.

1 Comment

by Andy Wheeler on May 30, 2017 • Permalink

Posted in Data Visualization, scholarly

Tagged data visualization, homicide-rates, paper, scholarly

Posted by Andy Wheeler on May 30, 2017

https://andrewpwheeler.com/2017/05/30/new-working-paper-monitoring-volatile-homicide-trends-across-u-s-cities/

Much ado about nothing: Overinterpreting volatility in homicide rates

I’m not much of a macro criminologist, but being asked questions by my dad (about Richard Rosenfeld and the Ferguson effect) and the dentist yesterday (asking about some of Trumps comments about rising crime trends) has prompted me to jump into it and give my opinion. Long story short — many sources I believe are overinterpreting short term fluctuations as more meaningful than they are.

First I will tackle national crime rates. So if you have happened to walk by a TV playing CNN the past few days, you may have heard Donald Trump being criticized for his statements on crime rates. This is partially a conflation with the difference between overall levels of crime versus changes in crime over time. Basically crime is currently low compared to historical patterns, but homicide rates have been rising in the past two years. This is easier to show in a chart than to explain in words. So here is the national estimated homicide rate per 100,000 individuals since 1960.¹

2016 is not official and is still an estimate, but basically the pattern is this – crime has been falling generally across the country since the early 1990’s. Crime rates in just the past few years have finally dropped below levels in the 1960’s, but for the past two years homicides have been increasing. So some have pointed to the increase in the past two years and have claimed the sky is falling. To say this they say the rate of change is the largest in past 40 years. There are better charts to show rates of change (a semi-log chart), but the overall look is basically the same.

You have to really squint to see that change from 2014 to 2015 is a larger jump than any of the changes over the entire period, so arguments based on the size of recent changes in the homicide rate are hyperbole (either on a linear scale or a logarithmic scale). And even if you take the recent increases over the past two years as evidence of a more general rising trend, for a broader term pattern we still have homicide rates close to a low point in the past 50 years.

For a bit of general advice — any source that gives you a percent change you always want to see the base numbers and any longer term historical trends. Any media source that cites recent increases in homicides without providing this graph of long term historical crime trends is simply misleading. I’ve seen this done in many places, see this example from the New York Times or this recent note from the Economist. So this isn’t something specific to the President.

Now, macro criminologists don’t really have any better track record explaining these patterns than macro economists have in explaining economic trends. Basically we have a bunch of patch work theories that make sense for parts of the trend, but not the entire time frame. Changes in routine activities in 1960’s, increases in incarceration, the decline of crack use, ease of calling 911 with cell-phones, lead use, abortion (just to name a few). And academics come up with new theories all the time, the most recent being the Ferguson effect — which is simply another term for de-policing.

Now a bit on trends for specific cities. How this ties in with the national trend is that some articles have been pointing out that some cities have seen increases and some have not. That is fine to point out (albeit trivial), but then the articles frequently go on generate stories about why crime is rising in those specific places. Those on the left cite civil unrest and police brutality as possible reasons (Milwaukee, St. Louis, Chicago, Baltimore), while those on the right cite the deleterious effects of police departments not being as proactive (stops in Chicago, arrests in Baltimore).

While any of these explanations may turn out reasonable in the end, I’m pretty sure most of these articles severely underappreciate the volatility in homicide rates. Take an example with St. Louis, with a city population of just over 300,000. A homicide rate of 50 individuals per 100,000 means a total of 150 murders. A homicide rate of 40 per 100,000 means 120 murders. So we are only talking about a change of 30 murders overall. Fluctuations of around 10 in the murder rate would not be unexpected for a city with a population of 300,000 individuals. The confidence interval for a rate of 150 murders per 300,000 individuals is 126 to 176 murders.²

Even that though understates the typical volatility in homicide rates. As basically that assumes the proportion does not change over time. In reality crime statistics are more bursty, and show wilder fluctuations in different places.³ To show this for many cities, I use the data from the Economist article mentioned earlier, and create a motion chart of the changes in homicide rates over time. The idea behind this chart is a funnel chart. Cities with lower populations will show higher variance, and subsequently those dots on the left hand side of the chart will jump around alot more. The population figures are current and not varying, so the dots just move up and down on the Y axis.

For best viewing, make the X axis on the log scale, and size the points according to the population of the city. If you are at a desktop computer, you can open up a bigger version of the chart here.

Selecting individual points and then letting the animation run though illustrates the typical variability of crime over time. Here is the trace of St. Louis over the 36 year period.

New Orleans is another good example, we have fluctuations from under 30 to over 90 in the time period.

And here is Chicago, which shows less fluctuation than the smaller cities (as expected) but still has a range of homicide rates around 20 over the time period.

Howard Wainer has previously pointed this relationship out, and called it The Most Dangerous Equation. Basically, if you look you will be able to find some upward crime trends, especially in smaller cities. You need to look at it in the long term though and understand typical fluctuations to make a reasonable decision as to whether crime is increasing or if it is just typical year to year variation. The majority of news articles on the topic and just chock full of post hoc ergo propter hoc for particular cherry picked cites, and they often don’t make sense in explaining crime patterns over the past decade in those particular cities, let alone make sense for different cities experience similar conditions but not having rising homicide rates.

For my notes about data sources, generally the data have come from the FBI UCR data tool (for the 1960 through 2014 data). 2015 data have come from the FBI web page for the 2015 UCR report. The 2016 projections come from this Economist article as well as the 50 cities data for the google motion chart.↩
Calculated in R via (binom.test(150,300000)$conf.int[1:2])*300000. This is the exact Clopper-Pearson confidence interval.↩
So even though this 538 article does a better job of acknowledging volatility, whatever test they use to determine statistically significant increases is likely to have too many false positives.↩

3 Comments

by Andy Wheeler on February 9, 2017 • Permalink

Posted in Criminal Justice, Data Visualization, scholarly

Tagged crime-trends, data visualization, scholarly

Posted by Andy Wheeler on February 9, 2017

https://andrewpwheeler.com/2017/02/09/much-ado-about-nothing-overinterpreting-volatility-in-homicide-rates/

Side by side charts in SPSS

One aspect of SPSS charts that you need to use syntax for is to create side-by-side charts. Here I will illustrate a frequent use case, time series charts with different Y axes. You can download the data and code to follow along here. This is data for Buffalo, NY on reported crimes from the UCR data tool.

So after you have downloaded the csv file with the UCR crime rates in Buffalo and have imported the data into SPSS, you can make a graph of violent crime rates over time.

*Making a chart of the violent crime rate.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year ViolentCrimerate MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: ViolentCrimerate=col(source(s), name("ViolentCrimerate"))
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
  ELEMENT: line(position(Year*ViolentCrimerate))
  ELEMENT: point(position(Year*ViolentCrimerate), color.interior(color.black), color.exterior(color.white), size(size."7"))
END GPL.

I like to superimpose the points on simple line charts, to reinforce where the year observations are. Here we can see that there is a big drop post 1995 for the following four years (something that would be hard to say exactly without the points). Part of the story of Buffalo though is the general decline in population (similar to most of the rust belt part of the nation since the 70’s).

*Make a chart of the population decline.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Population MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Population=col(source(s), name("Population"))
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Population"))
  ELEMENT: line(position(Year*Population))
  ELEMENT: point(position(Year*Population), color.interior(color.black), color.exterior(color.white), size(size."7"))
END GPL.

Now we want to place these two charts over top of one another. So check out the syntax below, in particular to GRAPH: begin statements.

*Now put the two together.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Population ViolentCrimerate
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL 
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Population=col(source(s), name("Population"))
  DATA: ViolentCrimerate=col(source(s), name("ViolentCrimerate"))
  GRAPH: begin(origin(14%,12%), scale(85%, 60%))
  GUIDE: axis(dim(1), label("Year"), opposite())
  GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
  ELEMENT: line(position(Year*ViolentCrimerate))
  ELEMENT: point(position(Year*ViolentCrimerate), color.interior(color.black), color.exterior(color.white), size(size."7"))
  GRAPH: end()
  GRAPH: begin(origin(14%, 75%), scale(85%, 20%)) 
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Population"))
  ELEMENT: line(position(Year*Population))
  ELEMENT: point(position(Year*Population), color.interior(color.black), color.exterior(color.white), size(size."7"))  
  GRAPH: end()
END GPL.

In a nutshell, the graph begin statements allow you to chunk up the graph space to make different/arbitrary plots. The percentages start in the top right, so for the first violent crime rate graph, the origin is listed as 14% and 12%. This means the graph starts 14% to the right in the overall chart space, and 12% down. These paddings are needed to make room for the axis labels. Then for the scale part, it lists it as 85% and 60%. The 85% means take up 85% of the X range in the chart, but only 60% of the Y range in the chart. So this shows how to make the violent crime chart take a bigger proportion. of the overall chart space than the population chart. You can technically do charts with varying axes in SPSS without this, but you would have to make the panels take up an equal amount of space. This way you can make the panels whatever proportion you want.

For Buffalo the big drop in 1996 is largely due to a very large reduction in aggravated assaults (from over 3,000 in 1995 to under 1,600 in 1996). So here I superimpose a bar to viz. the proportion of all violent crimes. This wouldn’t be my first choice of how to show this, but I think it is a good illustration of how to superimpose and/or stack additional charts using this same technique in SPSS.

*Also superimposing a stacked bar chart on the total types of crimes in the background.
COMPUTE PercentAssault = (Aggravatedassault/ViolentCrimeTotal)*100.
FORMATS PercentAssault (F2.0).
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Population ViolentCrimerate PercentAssault
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL 
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Population=col(source(s), name("Population"))
  DATA: ViolentCrimerate=col(source(s), name("ViolentCrimerate"))
  DATA: PercentAssault=col(source(s), name("PercentAssault"))
  GRAPH: begin(origin(14%,12%), scale(75%, 60%))
  GUIDE: axis(dim(1), label("Year"), opposite())
  GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
  ELEMENT: line(position(Year*ViolentCrimerate))
  ELEMENT: point(position(Year*ViolentCrimerate), color.interior(color.black), color.exterior(color.white), size(size."7"))
  GRAPH: end()
  GRAPH: begin(origin(14%, 75%), scale(75%, 20%)) 
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Population"))
  ELEMENT: line(position(Year*Population))
  ELEMENT: point(position(Year*Population), color.interior(color.black), color.exterior(color.white), size(size."7"))  
  GRAPH: end()
  GRAPH: begin(origin(14%, 12%), scale(75%, 60%)) 
  SCALE: linear(dim(2), min(0), max(60))
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), label("Percent Assault"), opposite(), color(color.red), delta(10))
  ELEMENT: bar(position(Year*PercentAssault), color.interior(color.red), transparency.interior(transparency."0.7"), transparency.exterior(transparency."1.0"), size(size."5"))
  GRAPH: end()
END GPL.

While doing multiple time series charts is a common use, you can basically use your imagination about what you want to accomplish with this. Another common example is to put border histograms on scatterplot (which the GPL reference guide has an example of). Here is an example I posted recently to Nabble that has the number of individuals at risk in a Kaplan-Meier plot.

Leave a comment

by Andy Wheeler on February 7, 2017 • Permalink

Posted in Data Visualization, SPSS

Tagged data visualization, grammar of graphics, SPSS

Posted by Andy Wheeler on February 7, 2017

https://andrewpwheeler.com/2017/02/07/side-by-side-charts-in-spss/

Heatmaps in SPSS

Heatmap is a visualization term that gets used in a few different circumstances, but here I mean a regular grid in which you use color to indicate particular values. Here is an example from Nathan Yau via FlowingData:

They are often not the best visualization to use to evaluate general patterns, but they offer a mix of zooming into specific individuals, as well as to identify overall trends. In particular I like using them to look at missing data patterns in surveys in SPSS, which I will show an example of in this blog post. Here I am going to use a community survey for Dallas in 2016. The original data can be found here, and the original survey questions can be found here. I’ve saved that survey as an SPSS file you can access at this link. (The full code in one sps syntax file is here.)

So first I am going to open up the data file from online, and name the dataset DallasSurv16.

*Grab the data from online.
SPSSINC GETURI DATA
URI="https://dl.dropbox.com/s/5e07yi9hd0u5opk/Surv2016.sav?dl=0"
FILETYPE=SAV DATASET=DallasSurv16.

Here I am going to illustrate making a heatmap with the questions asking about fear of crime and victimization, the Q6 questions. First I am going to make a copy of the original dataset, as we will be making some changes to the data. I do this via the DATASET COPY function, and follow it up by activating that new dataset. Then I do a frequency to check out the set of Q6 items.

DATASET COPY HeatMap.
DATASET ACTIVATE HeatMap.
FREQ Q6_1Inyourneighborhoodduringthe TO Q69Fromfire.

From the survey instrument, the nine Q6 items have values of 1 through 5, and then a "Don’t Know" category labeled as 9. All of the items also have system missing values. First we are going to recode the system missing items to a value of 8, and then we are going to sort the dataset by those questions.

RECODE Q6_1Inyourneighborhoodduringthe TO Q69Fromfire (SYSMIS = 8)(ELSE = COPY).
SORT CASES BY Q6_1Inyourneighborhoodduringthe TO Q69Fromfire.

You will see the effect of the sorting the cases in a bit for our graph. But the idea about how to make the heatmap in the grammar of graphics is that in your data you have a variable that specifies the X axis, a variable for the Y axis, and then a variable for the color in your heatmap. To get that set up, we need to go from our nine separate Q6 variables to one variable. We do this in SPSS by using VARSTOCASES to reshape the data.

VARSTOCASES /MAKE Q6 FROM Q6_1Inyourneighborhoodduringthe TO Q69Fromfire /INDEX = QType.

So now every person who answered the survey has 9 different rows in the dataset instead of one. The original answers to the questions are placed in the new Q6 variable, and the QType variable is a number of 1 to 9. So now individual people will go on the Y axis, and each question will go on the X axis. But before we make the chart, we will add the meta-data in SPSS to our new Q6 and QType variables.

VALUE LABELS QType
  1 'In your neigh. During Day'
  2 'In your neigh. At Night'
  3 'Downtown during day'
  4 'Downtown at night'
  5 'Parks during day'
  6 'Parks at Night'
  7 'From violent crime'
  8 'From property crime'
  9 'From fire'
.
VALUE LABELS Q6
 8 "Missing" 
 9 "Don't Know"
 1 'Very Unsafe'
 2 'Unsafe'
 3 'Neither safe or unsafe'
 4 'Safe'
 5 'Very Safe'
.
FORMATS Q6 QType (F1.0).

Now we are ready for our GGRAPH statement. It is pretty gruesome but just bare with me for a second.

TEMPORARY.
SELECT IF DISTRICT = 1.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=QType ID Q6
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(800px,2000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: QType=col(source(s), name("QType"), unit.category())
  DATA: ID=col(source(s), name("ID"), unit.category())
  DATA: Q6=col(source(s), name("Q6"), unit.category())
  GUIDE: axis(dim(1), opposite())
  GUIDE: axis(dim(2), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1", color.darkred),("2", color.red),("3", color.lightgrey), 
            ("4", color.lightblue), ("5", color.darkblue), ("9", color.white), ("8", color.white)))
  SCALE: cat(dim(2), sort.data(), reverse())
  ELEMENT: polygon(position(QType*ID), color.interior(Q6), color.exterior(color.grey), transparency.exterior(transparency."0.7"))
  PAGE: end()
END GPL.
EXECUTE.

And this produces the chart,

So to start, normally I would use the chart builder dialog to make the skeleton for the GGRAPH code and update that. Here if you make a scatterplot in the chart dialog and assign the color it gets you most of the way there. But I will walk through some of the other steps.

TEMPORARY. and then SELECT IF – these two steps are to only draw a heatmap for survey responses for the around 100 individuals from council district 1. Subsequently the EXECUTE. command at the end makes it so the TEMPORARY command is over.
Then for in the inline GPL code, PAGE: begin(scale(800px,2000px)) changes the chart dimensions to taller and skinnier than the default chart size in SPSS. Also note you need a corresponding PAGE: end() command when you use a PAGE: begin() command.
GUIDE: axis(dim(1), opposite()) draws the labels for the X axis on the top of the graph, instead of the bottom.
GUIDE: axis(dim(2), null()) prevents drawing the Y axis, which just uses the survey id to displace survey responses
SCALE: cat(aesthetic maps different colors to each different survey response. Feeling safe are given blues, and not safe are given red colors. I gave neutral grey and missing white as well.
SCALE: cat(dim(2), sort.data(), reverse()), this tells SPSS to draw the Y axis in the order in which the data are already sorted. Because I sorted the Q6 variables before I did the VARSTOCASES this sorts the responses with the most fear to the top.
The ELEMENT: polygon( statement just draws the squares, and then specifies to color the interior of the squares according to the Q6 variable. I given the outline of the squares a grey color, but white works nice as well. (Black is a bit overpowering.)

So now you have the idea. But like I said this can be hard to identify overall patterns sometimes. So sometimes I like to limit the responses in the graph. Here I make a heatmap of the full dataset (over 1,500 responses), but just look at the different types of missing data. Red is system missing in the original dataset, and Black is the survey filled in "Don’t Know".

*Missing data representation.
TEMPORARY.
SELECT IF (Q6 = 9 OR Q6 = 8).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=QType ID Q6 MISSING = VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(800px,2000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: QType=col(source(s), name("QType"), unit.category())
  DATA: ID=col(source(s), name("ID"), unit.category())
  DATA: Q6=col(source(s), name("Q6"), unit.category())
  GUIDE: axis(dim(1), opposite())
  GUIDE: axis(dim(2), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1", color.darkred),("2", color.red),("3", color.lightgrey), 
            ("4", color.lightblue), ("5", color.darkblue), ("9", color.black), ("8", color.red)))
  ELEMENT: polygon(position(QType*ID), color.interior(Q6), color.exterior(color.grey), transparency.exterior(transparency."0.7"))
  PAGE: end()
END GPL.
EXECUTE.

You can see the system missing across all 6 questions happens very rarely, I only see three cases, but there are a ton of "Don’t Know" responses. Another way to simplify the data is to use small multiples for each type of response. Here is the first graph, but using a panel for each of the individual survey responses. See the COORD: rect(dim(1,2), wrap()) and then the ELEMENT statement for the updates. As well as making the size of the chart shorter and fatter, and not drawing the legend.

*Small multiple.
TEMPORARY.
SELECT IF DISTRICT = 1.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=QType ID Q6
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(1000px,1000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: QType=col(source(s), name("QType"), unit.category())
  DATA: ID=col(source(s), name("ID"), unit.category())
  DATA: Q6=col(source(s), name("Q6"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), opposite())
  GUIDE: axis(dim(2), null())
  GUIDE: legend(aesthetic(aesthetic.color.interior), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1", color.darkred),("2", color.red),("3", color.lightgrey), 
            ("4", color.lightblue), ("5", color.darkblue), ("9", color.white), ("8", color.white)))
  SCALE: cat(dim(2), sort.data(), reverse())
  ELEMENT: polygon(position(QType*ID*Q6), color.interior(Q6), color.exterior(color.grey), transparency.exterior(transparency."0.7"))
  PAGE: end()
END GPL.
EXECUTE.

You technically do not need to reshape the data using VARSTOCASES at first to make these heatmaps (there is an equivalent VARSTOCASES command within GGRAPH you could use), but this way is simpler in my opinion. (I could not figure out a way to use multiple response sets to make these non-aggregated charts, so if you can figure that out let me know!)

The idea of a heatmap can be extended to much larger grids — basically any raster graphic can be thought of as a heatmap. But for SPSS you probably do not want to make heatmaps that are very dense. The reason being SPSS always makes its charts in vector format, you cannot tell it to just make a certain chart a raster. So a very dense heatmap will take along time to render. But I like to use them in some situations as I have shown here with smaller N data in SPSS.

Also off-topic, but I may be working on a cook-book with examples for SPSS graphics. If I have not already made a blog post let me know what you would examples you would like to see!

2 Comments

by Andy Wheeler on January 11, 2017 • Permalink

Posted in Data Visualization, SPSS

Tagged data visualization, heatmap, SPSS, survey

Posted by Andy Wheeler on January 11, 2017

https://andrewpwheeler.com/2017/01/11/heatmaps-in-spss/

Group based trajectory models in Stata – some graphs and fit statistics

For my advanced research design course this semester I have been providing code snippets in Stata and R. This is the first time I’ve really sat down and programmed extensively in Stata, and this is a followup to produce some of the same plots and model fit statistics for group based trajectory statistics as this post in R. The code and the simulated data I made to reproduce this analysis can be downloaded here.

First, for my own notes my version of Stata is on a server here at UT Dallas. So I cannot simply go

net from http://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

to install the group based trajectory code. First I have this set of code in the header of all my do files now

*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns"
*to install on your own in the lab it would be
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns"

So after running that then I can install the traj command, and Stata will know where to look for it!

Once that is taken care of after setting the working directory, I can simply load in the csv file. Here my first variable was read in as Ã¯id instead of id (I’m thinking because of the encoding in the csv file). So I rename that variable to id.

*Load in csv file
import delimited GroupTraj_Sim.csv
*BOM mark makes the id variable weird
rename Ã¯id id

Second, the traj model expects the data in wide format (which this data set already is), and has counts in count_1, count_2 … count_10. The traj command also wants you to input a time variable though to, which I do not have in this file. So I create a set of t_ variables to mimic the counts, going from 1 to 10.

*Need to generate a set of time variables to pass to traj, just label 1 to 10
forval i = 1/10 { 
  generate t_`i' = `i'
}

Now we can estimate our group based models, and we get a pretty nice default plot.

traj, var(count_*) indep(t_*) model(zip) order(2 2 2) iorder(0)
trajplot

Now for absolute model fit statistics, there are the average posterior probabilities, the odds of correct classification, and the observed classification proportion versus the expected classification proportion. Here I made a program that I will surely be ashamed of later (I should not brutalize the data and do all the calculations in matrix), but it works and produces an ugly table to get us these stats.

*I made a function to print out summary stats
program summary_table_procTraj
    preserve
    *updating code to drop missing assigned observations
    drop if missing(_traj_Group)
    *now lets look at the average posterior probability
	gen Mp = 0
	foreach i of varlist _traj_ProbG* {
	    replace Mp = `i' if `i' > Mp 
	}
    sort _traj_Group
    *and the odds of correct classification
    by _traj_Group: gen countG = _N
    by _traj_Group: egen groupAPP = mean(Mp)
    by _traj_Group: gen counter = _n
    gen n = groupAPP/(1 - groupAPP)
    gen p = countG/ _N
    gen d = p/(1-p)
    gen occ = n/d
    *Estimated proportion for each group
    scalar c = 0
    gen TotProb = 0
    foreach i of varlist _traj_ProbG* {
       scalar c = c + 1
       quietly summarize `i'
       replace TotProb = r(sum)/ _N if _traj_Group == c 
    }
	gen d_pp = TotProb/(1 - TotProb)
	gen occ_pp = n/d_pp
    *This displays the group number [_traj_~p], 
    *the count per group (based on the max post prob), [countG]
    *the average posterior probability for each group, [groupAPP]
    *the odds of correct classification (based on the max post prob group assignment), [occ] 
    *the odds of correct classification (based on the weighted post. prob), [occ_pp]
    *and the observed probability of groups versus the probability [p]
    *based on the posterior probabilities [TotProb]
    list _traj_Group countG groupAPP occ occ_pp p TotProb if counter == 1
    restore
end

This should work after any model as long as the naming conventions for the assigned groups are _traj_Group and the posterior probabilities are in the variables _traj_ProbG*. So when you run

summary_table_procTraj

You get this ugly table:

     | _traj_~p   countG   groupAPP        occ     occ_pp       p    TotProb |
     |-----------------------------------------------------------------------|
  1. |        1      103   .9379318   43.57342   43.60432   .2575   .2573645 |
104. |        2      136   .9607258   47.48513   48.30997     .34   .3361462 |
240. |        3      161   .9935605   229.0413   225.2792   .4025   .4064893 |

The groupAPP are the average posterior probabilities – here you can see they are all quite high. occ is the odds of correct classification, and again they are all quite high. Update: Jeff Ward stated that I should be using the weighted posterior proportions for the OCC calculation, not the proportions based on the max. post. probability (that I should be using TotProb in the above table instead of p). So I have updated to include an additional column, occ_pp based on that suggestion. I will leave occ in though just to keep a paper trail of my mistake.

p is the proportion in each group based on the assignments for the maximum posterior probability, and the TotProb are the expected number based on the sums of the posterior probabilities. TotProb should be the same as in the Group Membership part at the bottom of the traj model. They are close (to 5 decimals), but not exactly the same (and I do not know why that is the case).

Next, to generate a plot of the individual trajectories, I want to reshape the data to long format. I use preserve in case I want to go back to wide format later and estimate more models. If you need to look to see how the reshape command works, type help reshape at the Stata prompt. (Ditto for help on all Stata commands.)

preserve
reshape long count_ t_, i(id)

To get the behavior I want in the plot, I use a scatter plot but have them connected via c(L). Then I create small multiples for each trajectory group using the by() option. Before that I slightly jitter the count data, so the lines are not always perfectly overlapped. I make the lines thin and grey — I would also use transparency but Stata graphs do not support this.

gen count_jit = count_ + ( 0.2*runiform()-0.1 )
graph twoway scatter count_jit t_, c(L) by(_traj_Group) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

I’m too lazy at the moment to clean up the axis titles and such, but I think this plot of the individual trajectories should always be done. See Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond (Erosheva et al., 2014).

While this fit looks good, this is not the correct number of groups given how I simulated the data. I will give those trying to find the right answer a few hints; none of the groups have a higher polynomial than 2, and there is a constant zero inflation for the entire sample, so iorder(0) will be the correct specification for the zero inflation part. If you take a stab at it let me know, I will fill you in on how I generated the simulation.

198 Comments

by Andy Wheeler on October 6, 2016 • Permalink

Posted in Data Visualization, Stata

Tagged data visualization, group-based-trajectory, panel-data, Stata

Posted by Andy Wheeler on October 6, 2016

https://andrewpwheeler.com/2016/10/06/group-based-trajectory-models-in-stata-some-graphs-and-fit-statistics/

Plotting panel data with many lines in SPSS

A quick blog post – so you all are not continually assaulted by my mug shot on the front page of the blog!

Panel data is complicated. When conducting univariate time series analysis, pretty much everyone plots the series. I presume people do not do this often for panel data because the charts tend to be more messy and less informative. But by using transparency and small multiple plots are easy places to start to unpack the information. Here I am going to show these using plots of arrest rates from 1970 through 2014 in New York state counties. The data and code can be downloaded here, and that zip file contains info. on where the original data came from. It is all publicly available – but mashing up the historical census data for the population counts by county is a bit of a pain.

So I will start with grabbing my created dataset, and then making a default plot of all the lines. Y axis is the arrest rate per 1,000 population, and the X axis are years.

*Grab the dataset.
FILE HANDLE data /NAME = "!!Your File Handle Here!!!".
GET FILE = "data\Arrest_WPop.sav".
DATASET NAME ArrestRates.

*Small multiple lines over time - default plot.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Total Arrest Rate per 1,000"))
  ELEMENT: line(position(Year*Total_Rate), split(County))
END GPL.

That is not too bad, but we can do slightly better by making the lines small and semi-transparent (which is the same advice for dense scatterplots):

*Make them transparent and smaller.
FORMATS Total_Rate (F2.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Total Arrest Rate per 1,000"))
  SCALE: linear(dim(1), min(1970), max(2014))
  ELEMENT: line(position(Year*Total_Rate), split(County), transparency(transparency."0.7"), size(size."0.7"))
END GPL.

This helps disentangle the many lines bunched up. There appear to be two outliers, and basically the rest of the pack.

A quick way to check out each individual line is then to make small multiples. Here I wrap the panels, and make the plot size bigger. I also make the X and Y axis null. This is ok though, as I am just focusing on the shape of the trend, not the absolute level.

*Small multiples.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(1000px,1000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), null())
  GUIDE: axis(dim(3), opposite())
  SCALE: linear(dim(1), min(1970), max(2014))
  ELEMENT: line(position(Year*Total_Rate*County))
  PAGE: end()
END GPL.
*Manually edited to make less space between panels.

There are a total of 62 counties in New York, so this is feasible. With panel sets of many more lines, you can either split the small multiple into more graphs, or cluster the lines based on the overall shape of the trend into different panels.

Here you can see that the outliers are New York county (Manhattan) and Bronx county. Bronx is a pretty straight upward trend (which mirrors many other counties), but Manhattan’s trajectory is pretty unique and has a higher variance than most other places in the state. Also you can see Sullivan county has quite a high rate compared to most other upstate counties (upstate is New York talk for everything not in New York City). But it leveled off fairly early in the time series.

This dataset also has arrest rates broken down by different categories; felony (drug, violent, dwi, other), and misdemeanor (drug, dwi, property, other). It is interesting to see that arrest rates have been increasing in most places over this long time period, even though crime rates have been going down since the 1990’s. They all appear to be pretty correlated, but let me know if you use this dataset to do some more digging. (It appears index crime totals can be found going back to 1990 here.)

11 Comments

by Andy Wheeler on August 12, 2016 • Permalink

Posted in Data Visualization, SPSS

Tagged data visualization, panel-data, small-multiples, SPSS

Posted by Andy Wheeler on August 12, 2016

https://andrewpwheeler.com/2016/08/12/plotting-panel-data-with-many-lines-in-spss/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 388 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts tagged data visualization

Compiling the documents

Setting up the online store

Print on Demand using LuLu

Other Random Book Stuff

Personal Goals

References

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange