All posts by Andy Wheeler

Jane Jacobs on Neighborhoods

Google image on 5/4/16 – Jane Jacobs 100th birthday.

Continuing on with my discussion of neighborhoods, it seems apt to talk about how Jane Jacob defines neighborhoods in her book The Death and Life of Great American Cities. She is often given perfunctory citations in criminology articles for the idea that mixed zoned neighborhoods (those with both residential and commercial zoning all together) increase “eyes on the street” and thus can reduce crime. I will write another post more fully about informal social control and how her ideas fit in, but here I want to focus on her conception of neighborhoods.

She asks the question, what do neighborhoods accomplish – and from this attempts to define what is a neighborhood. Thus she feels a neighborhood can only be defined in terms of having actual political capital to use for its constituents – and so a neighborhood is any region in which can be organized enough to act as an independent polity and campaign against the larger government in which it is nested. This is quite different from the current conception of neighborhoods in most social sciences – in which we assume neighborhoods exist, calculate their effects on behavior, and then tautologically say they exist when we find they do affect behavior.

Based on her polity idea Jacob’s defines three levels of possible neighborhoods:

your street
the greater area of persons – around 100,000
the entire city

This again is quite different than most of current social sciences. Census tracts and block groups are larger swaths than just one block. In very dense cities like NYC, census block groups are often the square defined by streets on either side, so they split apart people across the street from one another. Census tracts intentionally are made to contain around 8,000 people. For a counter-example criminologist, Ralph Taylor does think streets are neighborhood units. A hearsay paraphrase of his I recently heard was “looking out your front door, all that matters is how far you can see down the street in either direction”.

I think Jacob’s guesstimate of 100,000 may be partly biased from only thinking about giant megapolises. Albany itself has only around 100,000 residents – so it is either streets or the whole city per her definition. Although I can’t argue much about smaller areas having little to no political capital to accomplish specific goals.

In some of the surveys I have participated in they have asked the question about how you define your neighborhood. Here are the responses for two different samples (from the same city at two different time points) for an example:

I think the current approach – that neighborhoods are defined by their coefficients on the right hand side of regression models is untenable in the end – based on ideas that are derivatives of the work in my dissertation. Mainly that discrete neighborhood boundaries are a fiction of social scientists.

1 Comment

by Andy Wheeler on May 4, 2016 • Permalink

Posted in scholarly

Tagged neighborhoods, scholarly

Posted by Andy Wheeler on May 4, 2016

https://andrewpwheeler.com/2016/05/04/jane-jacobs-on-neighborhoods/

Neighborhoods in Albany according to Google

One of the most vexing aspects of spatial analysis in the social sciences in the concept of neighborhoods. There is a large literature on neighborhood effects in criminology, but no one can really define a neighborhood. For analysis they are most often assumed to approximately conform to census areas (like tracts or blocks). Sometimes there are obvious physical features that divide neighborhoods (most often a major roadway), but more often boundaries are fuzzy.

I’ve worked on several surveys (at the Finn Institute) in which we ask people what neighborhood they live in as well as the nearest intersection to their home. Even where there is a clear border, often people say the “wrong” neighborhood, especially near the borders. IIRC, when I calculated the wrongness for one survey in Syracuse we did it was only around 60% of the time the respondents stated they lived the right neighborhood. I do scare quotes around “wrong” because it is obviously arbitrary where people draw the boundaries, so more people saying the wrong neighborhood is indicative of the borders being misaligned than the respondents being wrong.

For this reason I like the Google maps approach in which they just place a label at the approximate center of noteworthy neighborhoods. I emulated this for a recent background map I made for a paper in Albany. (Maps can be opened in a separate tab to see a larger image.)

As background I did not grow up in Albany, but I’ve lived and worked in the Capital District since I came up to Albany for grad school – since 2008. Considering this and the fact that I make maps of Albany on a regular basis is my defense I have a reasonable background to make such judgements.

When looking at Google’s reverse geocoding API the other day I noticed they returned a neighborhood field in the response. So I created a regular sampling grid over Albany to see what they return. First, lets see my grid and where Google actually decides some neighborhood exists. Large grey circles are null, and small red circles some neighborhood label was returned. I have no idea where Google culls such neighborhood labels from.

See my python code at the end of the post to see how I extracted this info. given an input lat-lng. In the reverse geo api they return multiple addresses – but I only examine the first returned address and look for a neighborhood. (So I could have missed some neighborhoods this way – it would take more investigation.)

Given the input fishnet I then dissolved the neighborhood labels into areas. Google has quite a few more specific neighborhoods than me.

I’ve never really made much of a distinction between West Hill and Arbor Hill – although the split is clearly at Henry Johnson. Also I tend to view Pine Hill as the triangle between Western and Central before the State campus – but Google and others seem to disagree with me. What I call the Pinebush Google calls the Dunes. Dunes is appropriate, because it actually has sand dunes, but I can’t recall anyone referring to it as that. Trees are pretty hard to come by in Arbor Hill though, so don’t be misled. Also kill is Dutch for creek, so you don’t have to worry that Normanskill is such a bad place (even if your name is Norman).

For a third opinion, see albany.com

You can see more clearly in this map how Pine Hill’s area goes south of Madison. Google maps has a fun feature showing related maps, and so they show a related map on someones take for where law students should or should not get an apartment. In that map you can see that south of Madison is affectionately referred to as the student ghetto. That comports with my opinion as well, although I did not think putting student ghetto was appropriate for my basemap for a journal article!

People can’t seem to help but shade Arbor Hill in red. Which sometimes may be innocent – if red is the first color used in defaults (as Arbor Hill will be the first neighborhood in an alphabetic list). But presumably the law student making the apartment suggestions map should know better.

In short, it would be convenient for me (as a researcher) if everyone could agree with what a neighborhood is and where its borders are, but that is not reality.

Here is the function in Python to grab the neighborhood via the google reverse geocoding API. Here if it returns anything it grabs the first address returned and searches for the neighborhood in the json. If it does not find a neighborhood it returns None.

#Reverse geocoding and looking up neighborhoods
import urllib, json

def GoogRevGeo(lat,lng,api=""):
  base = r"https://maps.googleapis.com/maps/api/geocode/json?"
  GeoUrl = base + "latlng=" + str(lat) + "," + str(lng) + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw = response.read()
  jsonData = json.loads(jsonRaw)
  neigh = None
  if jsonData['status'] == 'OK':
    for i in jsonData['results'][0]['address_components']:
      if i['types'][0] == 'neighborhood':
        neigh = i['long_name']
        break
  return neigh

3 Comments

by Andy Wheeler on May 3, 2016 • Permalink

Posted in geocoding, Mapping, Python

Tagged google-geocoding-api, mapping, neighborhoods, Python

Posted by Andy Wheeler on May 3, 2016

https://andrewpwheeler.com/2016/05/03/neighborhoods-in-albany-according-to-google/

Shape, color, and pattern constants in SPSS charts

I have a version of the SPSS (Statistics) Version 24 GPL reference guide bookmarked here. The reference guide is great to skim through and see what is possible in SPSS charts – especially the set of examples on pages 329 to 411.

On page 413 they also give a set of constant colors, shapes, and texture patterns you can use in charts. Colors you can also specify in RGB scale, but it is often convenient to just say color.red or color.pink etc. Shapes and patterns for practical purposes you have to choose among the constants. (Technically in the chart template you can edit the cycle styles, and change a circle to an ellipse for example, or change the points for a dash pattern, but this would be painful for anything besides a few constants.)

Here is a handy reference guide to actually visualize those constants. Many you can guess what they look like, but the colors are more subtle. Who knew there were differences between tomato, salmon, and pink! (The tomato is more like tomato soup color.)

Here are the color constants (you can open the chart in a new tab to see a larger image):

The shape constants:

The elbow and the elbowArrow do not look correct – but will take some more time to investigate. The others look ok to me though. (The number of sides and star points appear to me to be something you can also manipulate in the chart template cycles, if for some reason you want a hendecagon).

And here are the pattern constants. I plot them with a grey filled interior – you can see some specifically only have an outline and always have a transparent fill:

Here is code to replicate the charts, and here is a PDF to download with the constants. The colors and shapes are hard to read because they are squeezed in so small, but you can zoom into the PDF and read the categories. I haven’t used dashed lines ever, so I omit those constants here. (Patterns I use pretty rarely, but I have used them if there are only two categories.)

A useful change for the colors would be sorting in a logical order. They are just currently in alphabetical. I am too lazy though to convert the colors to a colorspace and sort them though. (Maybe converting the PDF to SVG would do the trick easy enough though.)

2 Comments

by Andy Wheeler on April 26, 2016 • Permalink

Posted in SPSS

Tagged grammar of graphics, SPSS

Posted by Andy Wheeler on April 26, 2016

https://andrewpwheeler.com/2016/04/26/shapes-colors-and-pattern-constants-in-spss-charts/

Posting my peer reviews on Publons and a few notes about reviewing

Publons is a service that currates your peer review work. In a perfect world this would be done by publishers – they just forward your reviews with some standard meta-data. This would be useful when one paper is reviewed multiple times, as well as identifying good vs. poor reviewers. I forget where I saw the suggestion recently (maybe on the orgtheory or scatterplot blog), but someone mentioned it would be nice if you submit your paper to a different journal to forward your previous reviews at your discretion. I wouldn’t mind that at all, as oft the best journals will reject for lesser reasons because they can be more selective. (Also I’ve gotten copy-paste same reviews from different journals, even though I have updated the draft to address some of the comments. Forwarding would allow me to address those comments directly before the revise-resubmit decision.)

I’ve posted all of my reviews so far, but they are only public if the paper is eventually accepted. So here you can see my review for the recent JQC article Shooting on the Street: Measuring the Spatial Influence of Physical Features on Gun Violence in a Bounded Street Network by Jie Xu and Elizabeth Griffiths.

I’ve done my fair share of complaining about reviews before, but I don’t think the whole peer-review process is fatally flawed despite its low reliability. People take peer review a bit too seriously at times – but that is a problem for most academics in general. Even if you think your idea is not getting a fair shake, just publish it yourself on your website (or places like SSRN and ArXiv). This of course does not count towards things like tenure – but valuing quantity over quality is another separate problem currently in academia.

In the spirit of do unto others as you would have them do unto you, here are a two main points I try to abide by when I review papers.

be as specific as possible in your critique

There is nothing more frustrating than getting a vague critique (the paper has multiple mispellings and grammar issues). A frequent one I have come across (both in reviews of my papers and seeing comments others have made on papers I’ve reviewed) is in the framing of the paper – a.k.a. the literature review. (Which makes sending the paper to multiple journals so frustrating, you will always get more arbitrary framing debates each time with new reviewers.)

So for a few examples:

(bad) The literature review is insufficient
(good) The literature review skips some important literature, see specifically (X, 2004; Y, 2006; Z, 2007). The description of (A, 2000) is awkward/wrong.
(bad) The paper is too long, it can be written in half the length
(better) The paper could be shortened, section A.1 can be eliminated in my opinion, and section A.2 can be reduced to one paragraph on X.

Being specific provides a clear path for the author to correct what you think, or at least respond if they disagree. The “you can cut the paper in half” I don’t even know how to respond to effectively, nor the generic complaint about spelling. One I’ve gotten before is “x is an innapropriate measure” with no other detail. This is tricky because I have to guess why you think it is innapropriate, so I have to make your argument for you (mind read) and then respond why I disagree (which obviously I do, or I wouldn’t have included that measure to begin with). So to respond to a critique I at first have to critique my own paper – maybe this reviewer is more brilliant than I thought.

Being specific I also think helps cut down on arbitrary complaints that are arguable.

provide clear signals to the editor, both about main critiques and the extent to which they can be addressed

Peer review has two potential motivations, one is a gate-keeper and one is to improve the draft. Often times arbitrary advice by reviewers intended for the latter is not clearly delineated in the review, so it is easily confused for evidence pertinent to the gate-keeper function. I’ve gotten reviews of 20 bullet points or 2,000 words that make it seem like a poor paper due to sheer length of the comment, but the majority are minor points or arbitrary suggestions. Longer reviews actually suggest the paper is better – if there is something clearly wrong you can say it in a much shorter space.

Gabriel Rossman states these different parts of peer review a bit more succintly than me:

You need to adopt a mentality of “is it good how the author did it” rather than “how could this paper be made better”

I think this is a good quip to follow. I might add “don’t sweat the small stuff” to that as well. Some editors will read the paper and reviews and make judgement calls – but some editors just follow the reviewers blindly – so I worry with the 20 bullet point minor review that it unduly influenced a reject decision. I’m happy to respond to the bullets, and happy you took the time, but I’m not happy about you (the reviewer) not giving clear advice to the editor of the extent to which I can address those points.

I still give advice about improving the manuscript, but I try to provide clear signals to the editor about main critiques, and I also will explicitly state whether they can be addressed. The “can be addressed” is not for the person writing the paper – it is for the editor making the decision for whether to revise-and-resubmit! The main critiques in my experience will either entail 2-3 main points (or none at all for some papers). I also typically say when things are minor and put them in a separate section, which editors can pretty much ignore.

Being a quantitative guy, the ones that frustrate me the most are complaints about model specifications. Some are legitimately major complaints, but often times it will be things that are highly unlikely to greatly influence the reported results. Examples are adding/dropping/changing a particular control variable and changes in the caliper used for propensity score matching. Note I’m not saying you shouldn’t ask to see differences, I’m asking that you clearly articulate why your suggestion is preferable and make an appropriate judgement as to whether it is a big problem or a little problem. A similar complaint is what information to include in tables or in the main manuscript or appendix. The author already has the information, so it is minor editing, not a major problem.

While I am here I will end with three additional complaints that don’t fit into anywhere previously in my post. One, multiple rounds of review are totally a waste. So the life cycle of the paper-review should be

paper -> review -> editor decision -> reject/accept
                                          or
                                      revise-resumbit -> new paper + responses to reviews -> editor decision

The way the current system works, I have to submit another review after the new paper has been submitted. I rather the editor take the time to see if the authors sufficiently addressed the original complaints, because as a reviewer I am not an unbiased judge of that. So if I say something is important and you retort it is not, what else do you want me to say in my second review! It is the editors job at that point to arbiter disagreements. This then stops the cycle of multiple rounds of review, which have a very large amount of diminishing returns in improving the draft.

This then leads into my second complaint, generally about keeping a civil tone for reviews. In general I don’t care if a reviewer is a bit gruff in the review – it is not personal. But since reviewers have a second go, when I respond I need to keep an uber deferential tone. I don’t think that is really necessary, and I’d rather original authors have similar latitude to be gruff in responses. Reviewers say stupid things all the time (myself included) and you should be allowed to retort that my critique is stupid! (Of course with reasoning as to why it is stupid…)

Finally, I had one reviewer start recently:

This paper is interesting and very well written…I will not focus on the paper’s merits, but instead restrict myself to areas where it can be improved.

The good is necessary to signal to the editor whether a paper should be published. I’ve started to try in my own reviews to include more of the good (which is definately not the norm) and argue why a paper should be published. You can see in my linked review of the Xu and Griffiths paper by the third round I simply gave arguments why the paper should be published, despite a disagreement about the change-point model they reported on in the paper.

3 Comments

by Andy Wheeler on April 12, 2016 • Permalink

Posted in scholarly

Tagged peer-review, scholarly, writing

Posted by Andy Wheeler on April 12, 2016

https://andrewpwheeler.com/2016/04/12/posting-my-peer-reviews-on-publons-and-a-few-notes-about-reviewing/

Using the Google Geocoding API with Python

Previously I posted how to use the geopy python library to call the Google geocode API. But somewhere along the way my version of geopy was not working (maybe because the API changed). Instead of figuring out that problem, I just wrote my own function to call the Google API directly. No need to worry about installing geopy.

Part of the reason I blog is so I have notes for myself – I’m pretty sure I’ve rewritten this several times for different quick geocoding projects, but I couldn’t find them this morning when I needed to do it again. So here is a blog post for my own future reference.

Here is the function, it takes as input the full string address. Also I was getting back some null responses by rapid fire calling the API (with only 27 addresses), so I set the function to delay for five seconds and that seemed to fix that problem.

import urllib, json, time
def GoogGeoAPI(address,api="",delay=5):
  base = r"https://maps.googleapis.com/maps/api/geocode/json?"
  addP = "address=" + address.replace(" ","+")
  GeoUrl = base + addP + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw = response.read()
  jsonData = json.loads(jsonRaw)
  if jsonData['status'] == 'OK':
    resu = jsonData['results'][0]
    finList = [resu['formatted_address'],resu['geometry']['location']['lat'],resu['geometry']['location']['lng']]
  else:
    finList = [None,None,None]
  time.sleep(delay) #in seconds
  return finList

And here is an example use of the function. It returns the formatted address, the latitude and the longitude.

#Example Use
test = r"1600 Amphitheatre Parkway, Mountain View, CA"
geoR = GoogGeoAPI(address=test)
print geoR

This works for a few addresses without an API key. Even with an API key though the limit I believe is 2,500 – so don’t use this to geocode a large list. Also if you have some special characters in your address field this will take more work. For example if you have an & for an intersection I bet this url call will fail. But that should not be too hard to deal with. Also note the terms of service for using the API (which I don’t understand – so don’t ask me!)

I should eventually wrap up all of this google API python code into an extension for SPSS. Don’t hold your breath though for me getting the time to do that.

Here is an update for Python 3+ (the urllib library changed a bit). Also shows how to extract out the postal code.

#Update For Python 3+
#Also includes example parsing out the postal code
import urllib.request, urllib.parse 
import json, time
key = r'???!!!your key here!!!!????'

def GoogGeoAPI(address,api="",delay=3):
  base = r"https://maps.googleapis.com/maps/api/geocode/json?"
  addP = "address=" + urllib.parse.quote_plus(address)
  GeoUrl = base + addP + "&key=" + api
  response = urllib.request.urlopen(GeoUrl)
  jsonRaw = response.read()
  jsonData = json.loads(jsonRaw)
  if jsonData['status'] == 'OK':
    resu = jsonData['results'][0]
    post_code = -1
    for i in resu['address_components']:
      if i['types'][0] == 'postal_code':
        post_code = i['long_name'] #not sure if everything always has a long name?
    finList = [resu['formatted_address'],resu['geometry']['location']['lat'],resu['geometry']['location']['lng'],post_code]
  else:
    finList = [None,None,None,None]
  time.sleep(delay) #in seconds
  return finList
  
test = r"1600 Amphitheatre Parkway, Mountain View, CA"
geoR = GoogGeoAPI(address=test,api=key,delay=0)
print(geoR)

7 Comments

by Andy Wheeler on April 5, 2016 • Permalink

Posted in geocoding, Python

Tagged geocoding, google-maps-api, Python

Posted by Andy Wheeler on April 5, 2016

https://andrewpwheeler.com/2016/04/05/using-the-google-geocoding-api-with-python/

Presentation at ACJS 2016

I will be presenting at the ACJS (Academy of Criminal Justice Sciences) conference in Denver in a few days. My talk will be on some of the work I have been conducting with the Albany Police Department via the Finn Institute (Rob Worden and Sarah McLean are co-authors on the presentation). The title is Making stops smart: Predicting arrest rates from discretionary police stops at micro places in Albany, NY and here is the abstract:

Police stops are one of the most invasive, but regularly used crime control tactics by police. Similar to how focusing police resources at hot spots of crime can improve police efficiency, here we examine the spatial variation in arrest rates at micro places (street segments and intersections) in Albany, NY. Using data from over 240,000 discretionary police stops, we fit random effects logistic regression models to predict the probability of an arrest at different micro places. We show that like hot spots, there are examples of high arrest rate locations next to low arrest rate locations. Using a simulation, we show that if one displaced stops from low arrest locations to high arrest locations, one could make half as many stops but still have the same number of total arrests.

Here is a funnel chart of the arrest hit rates at micro-places across the city. You can see quite a bit of extra variation in arrest rates to attempt to explain.

I am giving this at 8 am on Thursday (see Event #185 in the program)

There will be two other presentations at the moment (Ling Wu is not going to make it), and they are:

Results from a victim generated crime mapping software, Zavin Nazaretian et al. – Indiana University of PA
Spatial analysis of aggravated assault and homicide crime scene, arrest and offender residence locations in Houston, TX, Elishewah Weisz – Sam Houston

So if you are interested in crime mapping stuff it should be a good session.

Feel free to bug me if you see me around at ACJS.

Also before I forget, my co-workers are presenting a poster on analysis of Syracuse Truce – a focused deterrence gang intervention. The posters are on Friday, so I won’t be around unfortunately. The title is Gangs, groups, networks, and deterrence: An evaluation of Syracuse Truce. (See poster #45 in the same program I linked to earlier.) Rob and Kelly will at least be manning the poster though – so you can go and bug them about the details!

Here is a picture of the reach of call-ins for one particular gang. The idea is for those who attended call ins to spread the message to other members. So this graph evaluates how well the call-ins would be expected to reach all of the members of the gang.

If you are wondering what I do for my job – yes I pretty much just make maps and graphs all day long 😉

Some GIS data scraping adventures: Banksy graffiti and gang locations in NYC

I’ve recently scraped some geographic data that I may use in my graduate level GIS course. I figured I would share with everyone, and take some time to describe for others how I scraped the data.

So to start, if you read an online article and it has a webmap with some GIS data in it – the data exists somewhere. It won’t always be the case that you can actually download the data, but for the most current and popular interactive mapping tools, the data is often available if you know where to look.

For example, I asked on the GIS stackexchange site awhile ago how can you download the point data in this NYC homicide map from the Times. I had emailed the reporters multiple times and they did not respond. A simple solution the answerers suggested was to use website developer tools to see what was being loaded when I refreshed the page. It happened that the map is being populated by a two simple text files (1,2).

It may be an interesting project to see how this compares (compiling via news stories) versus official data, which NYC recently released going back to 2006. Especially since such crowdsourced news datasets are used for other things, like counting mass shootings.

The two example mapping datasets I provide below though are a bit different process to get the underlying data – but just as easy. Many current webmaps use geojson files as the backend. What I did for the two examples below is I just looked at the html source for the website, and look for json data formats – links that specify “js” or “json” extensions. If you click through those external json links you can see if they have the data.

The other popular map type though comes from ESRI. You can typically find an ESRI server populating the map, and if the website has say a parcel data lookup you can often find an ESRI geocoding server (see here for one example of using an ESRI geocoding api). The maps though unfortunately do not always have exposed data. Sometimes what looks like vector data are actually just static PNG tiles. Council Districts in this Dallas map are an example. If you dig deep enough, you can find the PNG tiles for the council districts, but that does not do anyone much good. Pretty much all of those layers are available for download from other sources though. A similar thing happens with websites with crime reports, such as RAIDS Online or CrimeReports.com. They intentionally build the web map so you cannot scrape the data.

So that said, before we go further though – it should go without saying that you should not steal/plagiarize people’s articles or simply rip-off their graphics. Conducting new analysis with the publicly available data though seems fair game to me.

Banksy Taggings in NYC

There was a recent stink in the press about Kim Rossmo and company using geographic offender profiling to identify the likely home location of the popular graffiti artist Banksy. Here is the current citation of the journal article for those interested:

Hauge, M. V., Stevenson, M. D., Rossmo, D. K., and Le Comber, S. C. (2016). Tagging banksy: using geographic profiling to investigate a modern art mystery. Journal of Spatial Science, pages 1-6. doi:10.1080/14498596.2016.1138246

The article uses data from Britain, so I looked up to see if his taggings in other places was available. I came across this article showing a map of locations in New York City. So I searched where the data was coming from, and found the json file that contains the point data here. I just built a quick excel spreadsheet to parse the data, and you can download that spreadsheet here.

Gang Locations

This article posts a set of gang territories in NYC. This is pretty unique – I am unfamiliar with any other public data source that identifies gang territories. So I figured it would be a potential fun project for students in my GIS course – for instance in overlaying with the 311 graffiti data.

Again the data at the backend is in json format, and can be found here. To convert this data to a shapefile is a bit challenging, as it has points, lines and polygons all in the same file. What I did was buffer the lines and points by a small amount to be able to stuff them all in one shapefile. A zip file of that shapefile can be downloaded here.

Drop me a note if you use this data, I’d be interested in your analyses! Hence why I am sharing the data for others to play with 🙂

2 Comments

by Andy Wheeler on March 17, 2016 • Permalink

Posted in Mapping

Tagged web-scraping

Posted by Andy Wheeler on March 17, 2016

https://andrewpwheeler.com/2016/03/17/some-gis-data-scraping-adventures-banksy-graffiti-and-gang-locations-in-nyc/

On overlapping error bars in charts

Andrew Gelman posted an example graph the other day in a blog post which showed trends over time in measures of smiling in high school yearbook photos.

Surprisingly, Andrew did not make a comment on the error bars in the graph. Error bars with cross hairs are often distracting in the plot. In the example graph it is quite bad, in that they perfectly overlap, so the ends are very difficult to disentangle. Here I will suggest some alternatives.

I simulated data that approximately captures the same overall trends, and replicated the initial chart in SPSS.

First, a simple solution with only two groups is to use semi-transparent areas instead of the error bars.

This makes it quite easy to see the overlap and non-overlap of the two groups. This will even print out nice in black-white. In the end, this chart is over-complicated by separating out genders. Since each follow the same trend, with females just having a constant level shift over the entire study period, there is not much point in showing each in a graph. A simpler solution would just pool them together (presumably the error bars would be smaller by pooling as well). The advice here still applies though, and the areas are easier to viz. than the discontinuous error bars.

For more complicated plots with more groups, I would suggest doing small multiples.

While it is harder now to see the exact overlap between groups, we can at least visually assess the trends within each group quite well. In the original it is quite a bit of work to figure out the differences between groups and keep the within group comparisons straight. Since the trends are so simple it is not impossible, but with more variable charts it would be quite a bit of work.

For instances in which a trend line is not appropriate, you can dodge the individual error bars on the x-axis so that they do not perfectly overlap. This is the same principle as in clustered bar charts, just with points and error bars instead of bars.

Here I like using just the straight lines (a tip taken from Andrew Gelman). The serif part of the I beam like error bars I find distracting, and make it necessary to separate the lines further. Using just the lines you can pack many more into a small space, like caterpillar plots of many random effects.

Here is a copy of the SPSS syntax used to generate these graphs.

4 Comments

by Andy Wheeler on March 8, 2016 • Permalink

Posted in Data Visualization

Tagged data visualization, error-bars, SPSS

Posted by Andy Wheeler on March 8, 2016

https://andrewpwheeler.com/2016/03/08/on-overlapping-error-bars-in-charts/

Using and Making Cumulative Probability Charts

Stephen Few had a recent post critiquing an evaluation of a particular data visualization. Long story short, the experiment asked questions like “What is the probability that X is above 5?”, and showed the accuracy based on mean+error bar charts, histogram like visualizations, and animated vizualations showing random draws.

It is always the case in data viz. that some charts are easier to answer particular questions. This is one question, what is the probability a value is above X, in which traditional histograms or error bar charts are not well suited for. But there is an alternative I don’t see used very often, the cumulative probability chart, that is well suited to answer that question.

It is a totally reasonable question to ask as well. For one example use when I was a crime analyst, I used this chart to show the time in-between shootings. Many shootings are retaliatory so I was interested in saying if a shooting happened on Sunday, how long should be PD be on guard for after an initial shooting. Do most retaliatory shootings happen within hours, days, or weeks of a prior shooting? This is a hard question to answer with histograms, but is easier to answer with cumulative probability plots.

Here is that example chart for time-in-between shootings:

Although this chart is not regularly used, it is really easy to explain how to interpret. For example, at time equals 7 days (on the X axis), the probability that a shooting would have occurred is under 60%. In my opinion, it is easier to explain this chart than a histogram to a lay audience.

To produce the chart it is often not a canned option in software, but it takes very simple set of steps to produce the right ingrediants – and then you can use a typical line chart. So those steps generically are:

sort the data
rank the data (1 for the lowest value, 2 for the second lowest value, etc.)
calculate rank/(total sample size) – call this Prop
plot the data on the X axis, and Prop on the Y axis

Which can be easily done in any software, but here you can download an excel spreadsheet here used to make the above chart.

A variant of this chart often used in crime analysis is the proportion of places on the X axis and the cumulative proportion of crime on the Y axis. E.g. Pareto’s 80/20 rule – or 50/1 rule – or whatever. The chart makes it easy to pick whatever cut-offs you want. If you have your spatial units of analysis in one column, and the total number of crimes in a second column, the procedure to produce this chart is:

sort the data descending by number of crimes
rank the data
calculate rank/(total sample size) – this equals the proportion of all spatial units – call this PropUnits
calculate the cumulative number of crimes – call this Cum_Crime
calculate Cum_Crime/(Total Crime) – this equals the proportion of all crimes – call this PerCumCrime
plot PerCumCrime on the Y axis and PropUnits on the X axis.

See the third sheet of the excel file for a hypothetical example. This pattern basically happens in all aspects of criminal justice. That is, the majority of the bad stuff is happening among a small number of people/places. See this example from William Spelman showing places, victims, and offenders.

We can see there that 10% of the victims account for 40% of all victimizations etc.

Adding a command button to a toolbar in ArcGIS

I’m currently teaching a graduate level class in Crime Mapping using ArcGIS. I make my own tutorials from week to week, and basically sneak in generic pro-tips for using the software while students are doing other regular types of analyses. I can only subject my students to so much though – but here is one I have found useful, adding a regularly used button to a toolbar.

I use CrimeStat to generate kernel densities from point data, so as of V10 whenever I want to make a classified raster map I get this error:

V9 it used to just do this for you automatically :(.

I typically make classified raster maps simply because I think they look nicer than continous ones. My continuous ones always look fuzzy, whereas having discrete cuts you can focus attention on particular hot spot areas. It is arbitrary for sure – but that is something we need to learn to live with when making maps.

So in class I had students open ArcToolBox, navigate down the tree, and find the Calculate Statistics tool for rasters. In my personal set up though I do this enough that I added the button to my toolbar. So first, go to the file menu and in customize -> toolbars make sure you have the spatial analyst toolbar selected. (Here is a kernel density grd file to follow along with if you want.)

Now in the right hand most edge of the new spatial analyst toolbar, left click on the little downward pointing arrow and select Customize. (Sorry, my toolbar is a bit crowded!)

In the customize window that pops up, select the Commands tab. Now in this window you can select any particular command and then drag it onto any toolbar. Here I go to Data Management in the left hand categories area, and then scroll down till I find the Calculate Statistics button.

Then I left click on the Calculate Statistics row, hold down the mousebutton, and drag it to my toolbar.

Now you are done, and ArcGIS saves this button on the toolbar when making future maps. You can change the icon if you want, but there are tooltips when hovering over the icon (so even if you have multiple hammers on your toolbars it only takes a second to browse between them).

Andrew Wheeler