Some GIS data scraping adventures: Banksy graffiti and gang locations in NYC

I’ve recently scraped some geographic data that I may use in my graduate level GIS course. I figured I would share with everyone, and take some time to describe for others how I scraped the data.

So to start, if you read an online article and it has a webmap with some GIS data in it – the data exists somewhere. It won’t always be the case that you can actually download the data, but for the most current and popular interactive mapping tools, the data is often available if you know where to look.

For example, I asked on the GIS stackexchange site awhile ago how can you download the point data in this NYC homicide map from the Times. I had emailed the reporters multiple times and they did not respond. A simple solution the answerers suggested was to use website developer tools to see what was being loaded when I refreshed the page. It happened that the map is being populated by a two simple text files (1,2).

It may be an interesting project to see how this compares (compiling via news stories) versus official data, which NYC recently released going back to 2006. Especially since such crowdsourced news datasets are used for other things, like counting mass shootings.

The two example mapping datasets I provide below though are a bit different process to get the underlying data – but just as easy. Many current webmaps use geojson files as the backend. What I did for the two examples below is I just looked at the html source for the website, and look for json data formats – links that specify “js” or “json” extensions. If you click through those external json links you can see if they have the data.

The other popular map type though comes from ESRI. You can typically find an ESRI server populating the map, and if the website has say a parcel data lookup you can often find an ESRI geocoding server (see here for one example of using an ESRI geocoding api). The maps though unfortunately do not always have exposed data. Sometimes what looks like vector data are actually just static PNG tiles. Council Districts in this Dallas map are an example. If you dig deep enough, you can find the PNG tiles for the council districts, but that does not do anyone much good. Pretty much all of those layers are available for download from other sources though. A similar thing happens with websites with crime reports, such as RAIDS Online or CrimeReports.com. They intentionally build the web map so you cannot scrape the data.

So that said, before we go further though – it should go without saying that you should not steal/plagiarize people’s articles or simply rip-off their graphics. Conducting new analysis with the publicly available data though seems fair game to me.

Banksy Taggings in NYC

There was a recent stink in the press about Kim Rossmo and company using geographic offender profiling to identify the likely home location of the popular graffiti artist Banksy. Here is the current citation of the journal article for those interested:

Hauge, M. V., Stevenson, M. D., Rossmo, D. K., and Le Comber, S. C. (2016). Tagging banksy: using geographic profiling to investigate a modern art mystery. Journal of Spatial Science, pages 1-6. doi:10.1080/14498596.2016.1138246

The article uses data from Britain, so I looked up to see if his taggings in other places was available. I came across this article showing a map of locations in New York City. So I searched where the data was coming from, and found the json file that contains the point data here. I just built a quick excel spreadsheet to parse the data, and you can download that spreadsheet here.

Gang Locations

This article posts a set of gang territories in NYC. This is pretty unique – I am unfamiliar with any other public data source that identifies gang territories. So I figured it would be a potential fun project for students in my GIS course – for instance in overlaying with the 311 graffiti data.

Again the data at the backend is in json format, and can be found here. To convert this data to a shapefile is a bit challenging, as it has points, lines and polygons all in the same file. What I did was buffer the lines and points by a small amount to be able to stuff them all in one shapefile. A zip file of that shapefile can be downloaded here.

Drop me a note if you use this data, I’d be interested in your analyses! Hence why I am sharing the data for others to play with 🙂

On overlapping error bars in charts

Andrew Gelman posted an example graph the other day in a blog post which showed trends over time in measures of smiling in high school yearbook photos.

Surprisingly, Andrew did not make a comment on the error bars in the graph. Error bars with cross hairs are often distracting in the plot. In the example graph it is quite bad, in that they perfectly overlap, so the ends are very difficult to disentangle. Here I will suggest some alternatives.

I simulated data that approximately captures the same overall trends, and replicated the initial chart in SPSS.

First, a simple solution with only two groups is to use semi-transparent areas instead of the error bars.

This makes it quite easy to see the overlap and non-overlap of the two groups. This will even print out nice in black-white. In the end, this chart is over-complicated by separating out genders. Since each follow the same trend, with females just having a constant level shift over the entire study period, there is not much point in showing each in a graph. A simpler solution would just pool them together (presumably the error bars would be smaller by pooling as well). The advice here still applies though, and the areas are easier to viz. than the discontinuous error bars.

For more complicated plots with more groups, I would suggest doing small multiples.

While it is harder now to see the exact overlap between groups, we can at least visually assess the trends within each group quite well. In the original it is quite a bit of work to figure out the differences between groups and keep the within group comparisons straight. Since the trends are so simple it is not impossible, but with more variable charts it would be quite a bit of work.

For instances in which a trend line is not appropriate, you can dodge the individual error bars on the x-axis so that they do not perfectly overlap. This is the same principle as in clustered bar charts, just with points and error bars instead of bars.

Here I like using just the straight lines (a tip taken from Andrew Gelman). The serif part of the I beam like error bars I find distracting, and make it necessary to separate the lines further. Using just the lines you can pack many more into a small space, like caterpillar plots of many random effects.

Here is a copy of the SPSS syntax used to generate these graphs.

Using and Making Cumulative Probability Charts

Stephen Few had a recent post critiquing an evaluation of a particular data visualization. Long story short, the experiment asked questions like “What is the probability that X is above 5?”, and showed the accuracy based on mean+error bar charts, histogram like visualizations, and animated vizualations showing random draws.

It is always the case in data viz. that some charts are easier to answer particular questions. This is one question, what is the probability a value is above X, in which traditional histograms or error bar charts are not well suited for. But there is an alternative I don’t see used very often, the cumulative probability chart, that is well suited to answer that question.

It is a totally reasonable question to ask as well. For one example use when I was a crime analyst, I used this chart to show the time in-between shootings. Many shootings are retaliatory so I was interested in saying if a shooting happened on Sunday, how long should be PD be on guard for after an initial shooting. Do most retaliatory shootings happen within hours, days, or weeks of a prior shooting? This is a hard question to answer with histograms, but is easier to answer with cumulative probability plots.

Here is that example chart for time-in-between shootings:

Although this chart is not regularly used, it is really easy to explain how to interpret. For example, at time equals 7 days (on the X axis), the probability that a shooting would have occurred is under 60%. In my opinion, it is easier to explain this chart than a histogram to a lay audience.

To produce the chart it is often not a canned option in software, but it takes very simple set of steps to produce the right ingrediants – and then you can use a typical line chart. So those steps generically are:

  • sort the data
  • rank the data (1 for the lowest value, 2 for the second lowest value, etc.)
  • calculate rank/(total sample size) – call this Prop
  • plot the data on the X axis, and Prop on the Y axis

Which can be easily done in any software, but here you can download an excel spreadsheet here used to make the above chart.

A variant of this chart often used in crime analysis is the proportion of places on the X axis and the cumulative proportion of crime on the Y axis. E.g. Pareto’s 80/20 rule – or 50/1 rule – or whatever. The chart makes it easy to pick whatever cut-offs you want. If you have your spatial units of analysis in one column, and the total number of crimes in a second column, the procedure to produce this chart is:

  • sort the data descending by number of crimes
  • rank the data
  • calculate rank/(total sample size) – this equals the proportion of all spatial units – call this PropUnits
  • calculate the cumulative number of crimes – call this Cum_Crime
  • calculate Cum_Crime/(Total Crime) – this equals the proportion of all crimes – call this PerCumCrime
  • plot PerCumCrime on the Y axis and PropUnits on the X axis.

See the third sheet of the excel file for a hypothetical example. This pattern basically happens in all aspects of criminal justice. That is, the majority of the bad stuff is happening among a small number of people/places. See this example from William Spelman showing places, victims, and offenders.

We can see there that 10% of the victims account for 40% of all victimizations etc.

Adding a command button to a toolbar in ArcGIS

I’m currently teaching a graduate level class in Crime Mapping using ArcGIS. I make my own tutorials from week to week, and basically sneak in generic pro-tips for using the software while students are doing other regular types of analyses. I can only subject my students to so much though – but here is one I have found useful, adding a regularly used button to a toolbar.

I use CrimeStat to generate kernel densities from point data, so as of V10 whenever I want to make a classified raster map I get this error:

V9 it used to just do this for you automatically :(.

I typically make classified raster maps simply because I think they look nicer than continous ones. My continuous ones always look fuzzy, whereas having discrete cuts you can focus attention on particular hot spot areas. It is arbitrary for sure – but that is something we need to learn to live with when making maps.

So in class I had students open ArcToolBox, navigate down the tree, and find the Calculate Statistics tool for rasters. In my personal set up though I do this enough that I added the button to my toolbar. So first, go to the file menu and in customize -> toolbars make sure you have the spatial analyst toolbar selected. (Here is a kernel density grd file to follow along with if you want.)

Now in the right hand most edge of the new spatial analyst toolbar, left click on the little downward pointing arrow and select Customize. (Sorry, my toolbar is a bit crowded!)

In the customize window that pops up, select the Commands tab. Now in this window you can select any particular command and then drag it onto any toolbar. Here I go to Data Management in the left hand categories area, and then scroll down till I find the Calculate Statistics button.

Then I left click on the Calculate Statistics row, hold down the mousebutton, and drag it to my toolbar.

Now you are done, and ArcGIS saves this button on the toolbar when making future maps. You can change the icon if you want, but there are tooltips when hovering over the icon (so even if you have multiple hammers on your toolbars it only takes a second to browse between them).

Making and Exploring Crime Networks (Access and Excel)

I’ve been doing quite a bit of stuff with gang networks lately at work. Networks are a total PIA though to create and do data manipulation on in traditional spreadsheets and statistic tools, so I figured I would blog about some of my attempts to ease the pain for fellow crime analysts.

First I will show how to create an edge list in Access from the way a traditional police RMS database is set up. Second I will show a trick about exploring people and gangs by creating a dynamic lookup in Excel. You can download the Access Database I used and the Excel spreadsheet here to follow along.

Making an Edge List in Access

I’ve previously shown how to make an edgelist in SPSS. I’ll cast the net wider and show how to do this in Access though.

In a nutshell, an edge list is a table of the form:

Person A, Person B
Person B, Person C
Person C, Person D

Where being in the same row shows some type of connection between the two persons, e.g. Person A is connected to Person B. In police databases the connections most often of interest are co-offending (e.g. two people were arrested for the same incident) or being stopped together (e.g. in the same car or during the same field interrogation).

Typically police databases will have a table that lists a common incident identifier, along with persons associated with that incident and their involvement. Here is a screen shot of the simple example I made in an Access Database to mimic this which I named IncidentPersons:

So here we can see that for incident 1, Andy Pandy, Sandy Randy, and Candy Dandy are all persons involved. Candy is the victim, and the other two were arrested. This table is always called something different for every PD’s RMS system, but some examples I have come across are crossref and person_exploded. All RMS’s I have seen though have some sort of table like this.

To make an edge list from this table takes some knowledge of SQL, but it can be done in one query. Basically we will be joining a table to itself, and selecting out distinct rows. Here is the most basic SQL query in Access to accomplish this.

SELECT DISTINCT F.PersonID, F.PersonName, S.PersonID, S.PersonName
FROM IncidentPersons AS F INNER JOIN IncidentPersons AS S ON F.IncidentID = S.IncidentID
WHERE F.PersonID < S.PersonID;

To walk through this, we make two table aliases from the same original IncidentPersons table, F and S. Then we do an INNER JOIN based on the original incident ID. If we stopped here without the last WHERE clase, what would happen is we would have pairs of people with themselves, and with duplicate ties of the form A -> B and B -> A. So selecting only instances in which F.PersonID < S.PersonID eliminates those self edges and duplicates. The last part here is SELECT DISTINCT instead of select. This will make it so any particular edge is only returned once. (If you deleted DISTINCT in this database, Andy Pandy -> Sandy Randy would be returned twice.)

Running this query we then have:

In practice it will be more complicated because you will want to filter certain connections and add more info. on people into the final edge list. Here I ignore the involvement type, but you may want to only restrict matches to certain co-involvements (since offender-victim is of a different nature than co-offending). You also may want to not just know those connected, but count up the number of times those people are connected. For my work, I have always just limited to co-offending and being stopped together (and haven’t ever worried about the number of ties).

Also depending on how the database is normalized, often people names will change/have spelling errors, but they will still be linked to the same personid. These different spellings would cause the DISTINCT selection to not work as expected. A workaround is to only select based on the unique PersonID’s and not import other data, then in an additoional query merge in the person data. For gang network analysis you will likely want to merge in gang affiliation (which will probably be in a seperate table, not in the RMS). If you are still following along though you can figure that stuff out on your own.

Making an Edge Lookup Table in Excel

So now that I have shown how to make the edge table, what to do with it now? (No excuses – since I gave examples in both SPSS and SQL!) Here I will show a simple trick to explore the network using filtering in Excel.

The edge list itself is often the needed format to import into other network based software. So you can make a nice network graph using Gephi or whatever. The graph is good to see the overall form of the network when the graph is limited to only a few nodes, but they are typically really complicated, and tools like Gephi aren’t very good for drilling down into specific people. Here I will show my simple drilldown solution using Excel.

The network I use for this example is entirely made up; it was simulated using NetworkX (python), names are random based on some internet lists of popular baby names and last names I forgot the source of already, and Date of births are random between 1975 and 1997. I also made up a list of 7 gangs (but people have a 9/16 chance to be assigned to no gang).

So starting with an edgelist, here is a screenshot of my made up edge list excel table.

The problem in this format is if I filter the Id.1 column for 19 (BONNIE BARKER), they could potentially be in the Id.2 column as well, so I potentially miss edges. A simple solution to this is just to duplicate the data, but switch the order of the edges. Then when I filter by Id = 19, I will get all possible Bonnie Barker edges.

For a simple example of how to do this on a small table, if you start with:

17,19
18,19
19,20
19,21

If you filter the first column by 19, you will eliminate the 19’s in the second column. So just make a new table that has the ID’s reversed:

19,17
19,18
20,19
21,19

And then stack the two tables on top of one another

17,19 |
18,19 |  Table 1
19,20 |
19,21 |
19,17 +
19,18 +  Table 2
20,19 +
21,19 +

So now if you filter the first column by 19 you get 19’s all four connections. This is just three copy-pastes in excel to go from the original edge list to this table.

Now we can make a filter that dynamically changes based on user input. Here I make a selection in the top row, in N2 you can put in a persons ID. Then in A2, the formula is =IF(B2=$N$1,1,0). You can then paste this formula down, and it always references cell N2 because of the absolute $ modifiers.

Here is a screenshot of my example LookupTable in excel filtering for person 431.

If you update the personid in N1, then hit the reapply button in the toolbar (or hit Ctrl+Alt+L) to update the filter. Here I updated to be person 382.

The context of why I created this example was to identify people that were connected to gang members, but themselves were not in the gang. Basically have a list to take to officers and say, are you sure this person is not an actual member of the gang? The spreadsheet is then a tool if I have a meeting, where someone can say, who is Raelyn Hatfield connected to? I can easily update the id and filter.

You can do this drill down in the original edge table if you have the IF condition look in both the first and second id column, but I do this because it is easier to see who a person is connected to. You only have to look in one column – you don’t have to scan back and forth between two columns to see the connections.

You can also do other aggregations on this table as well. For instance if you aggregate using a pivot table and count the number of instances it is the edge centrality of a person (i.e. the number of different people a person is connected to).

If you want to do a drilldown of specific gangs you could use the same logic and build another filter column, but this will duplicate people when they are connected to another person in the same gang. That would be an instance where it might be easier to use just the original edge table.

Maps in inline GPL statements (SPSS)

Here I will go through an example of using inline GPL statements to import map backgrounds in SPSS charts. Here you can download the data and code to follow along with this post. This is different than using maps via VIZTEMPLATE, as I will show.

Note you can also use the graphboard template chooser to make some default maps, but I’ve never really learned how to make them on my own. For example, say I want to map that sets both the color and the transparency of areas based on different attributes. This is not possible with the current selection of map templates that comes with SPSS (V22).

But I figured out some undocumented ways to import maps into inline GPL code, and you can get pretty far with just the possibilities available within the grammar of graphics.

The data I will be using is a regular grid of values across DC. What I calculated was the hour of the day with the most Robberies over along time period (2011 through 2015 data) using a weighted average approach synonymous with geographically weighted regression. Don’t take this too seriously though, as there appears to be some errors in the time fields for the historical DC crime data.

So below I first define a handle to where my data is stored, recode the hour field into a smaller set of bins, and then make a scatterplot.

FILE HANDLE data /NAME = "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Inline_Maps_GGRAPH".

GET FILE = "data\MaxRobHour.sav".
DATASET NAME MaxRob.

*Basic Scatterplot.
FREQ HourEv.
RECODE HourEv (0 THRU 3 = 1)(11 THRU 19 = 2)(ELSE = COPY) INTO HourBin.
VALUE LABELS HourBin
 1 '0 to 3'
 2 '11 to 19'.

DATASET ACTIVATE MaxRob.
* Chart Builder.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=XMetFish[LEVEL=SCALE] YMetFish[LEVEL=SCALE] HourBin
    MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: XMetFish=col(source(s), name("XMetFish"))
  DATA: YMetFish=col(source(s), name("YMetFish"))
  DATA: HourBin=col(source(s), name("HourBin"), unit.category())
  GUIDE: axis(dim(1), label("XMetFish"))
  GUIDE: axis(dim(2), label("YMetFish"))
  GUIDE: legend(aesthetic(aesthetic.color.exterior), label("HourBin"))
  ELEMENT: point(position(XMetFish*YMetFish), color.exterior(HourBin))
END GPL.

We can do quite a bit to make this map look nicer. Here I change:

  • make the aspect ratio 1 to 1, and set the map limits
  • get rid of the X and Y axis (the particular projected coordinates make no difference)
  • make a nice set of colors based on a ColorBrewer palatte and map the color to the interior of the point

And below that is the map it produces.

*Making chart nice, same aspect ratio, colors, drop x & y.
FORMATS HourBin (F2.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=XMetFish[LEVEL=SCALE] YMetFish[LEVEL=SCALE] HourBin
    MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: XMetFish=col(source(s), name("XMetFish"))
  DATA: YMetFish=col(source(s), name("YMetFish"))
  DATA: HourBin=col(source(s), name("HourBin"), unit.category())
  COORD: rect(dim(1,2), sameRatio())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), null())
  GUIDE: legend(aesthetic(aesthetic.color.exterior), label("HourBin"))
  SCALE: linear(dim(1), min(389800), max(408000))
  SCALE: linear(dim(2), min(125000), max(147800))
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color."810f7c"),("2",color."edf8fb"),("20",color."bfd3e6"),("21",color."9ebcda"),
         ("22",color."8c96c6"),("23",color."8856a7")))
  ELEMENT: point(position(XMetFish*YMetFish), color.interior(HourBin))
END GPL.

So that is not too shabby a map for just plain SPSS. Now it is a bit hard to vizualize the patterns though, because the surface has needless discontinuities because of the circles. We can use squares as the shape and just do some experimentation to figure out the size needed to fill up each grid cell. Also pro-tip when making choropleth maps, with many areas often light outlines look nicer than black ones.

*Alittle nicer, squares, no outline.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=XMetFish[LEVEL=SCALE] YMetFish[LEVEL=SCALE] HourBin
    MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: XMetFish=col(source(s), name("XMetFish"))
  DATA: YMetFish=col(source(s), name("YMetFish"))
  DATA: HourBin=col(source(s), name("HourBin"), unit.category())
  COORD: rect(dim(1,2), sameRatio())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), null())
  GUIDE: legend(aesthetic(aesthetic.color.exterior), label("HourBin"))
  SCALE: linear(dim(1), min(389800), max(408000))
  SCALE: linear(dim(2), min(125000), max(147800))
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color."810f7c"),("2",color."edf8fb"),("20",color."bfd3e6"),("21",color."9ebcda"),
         ("22",color."8c96c6"),("23",color."8856a7")))
  ELEMENT: point(position(XMetFish*YMetFish), color.interior(HourBin), shape(shape.square), size(size."9.5"), 
           transparency.exterior(transparency."1"))
END GPL.

Again looking pretty good for just a map in plain SPSS. With the larger squares it is easier to clump together areas with similar patterns for the peak robbery time. The city never sleeps in Georgetown it appears. A few of the polygons though are very hard to see on the edge of DC though, so we will add in the outline. See the SOURCE: mapsrc, DATA: lon*lat, and the ELEMENT: polygon lines for how this is done. The “DCOutline.smz” is the map template file created by SPSS.

*Now include the outline.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=XMetFish[LEVEL=SCALE] YMetFish[LEVEL=SCALE] HourBin
    MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: XMetFish=col(source(s), name("XMetFish"))
  DATA: YMetFish=col(source(s), name("YMetFish"))
  DATA: HourBin=col(source(s), name("HourBin"), unit.category())
  SOURCE: mapsrc = mapSource(file("C:\\Users\\andrew.wheeler\\Dropbox\\Documents\\BLOG\\Inline_Maps_GGRAPH\\DCOutline.smz"))
  DATA: lon*lat = mapVariables(source(mapsrc))
  COORD: rect(dim(1,2), sameRatio())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), null())
  GUIDE: legend(aesthetic(aesthetic.color.exterior), label("HourBin"))
  SCALE: linear(dim(1), min(389800), max(408000))
  SCALE: linear(dim(2), min(125000), max(147800))
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color."810f7c"),("2",color."edf8fb"),("20",color."bfd3e6"),("21",color."9ebcda"),
         ("22",color."8c96c6"),("23",color."8856a7")))
  ELEMENT: point(position(XMetFish*YMetFish), color.interior(HourBin), shape(shape.square), size(size."9.5"), 
           transparency.exterior(transparency."1"))
  ELEMENT: polygon(position(lon*lat))
END GPL.

Now we have a bit more of a reference. The really late at night area appears to be north of Georgetown. The reason I figured this was even possible is that although mapSource is not documented in the GPL reference guide, there is an example using it with the project function (see page 194).

Now, if I were only making one map this isn’t really much of a help – I would just export the data values, make it in ArcGIS and be done with it. But, one of the things hard to do in GIS is make small multiple maps. That is something we can do fairly easily in stat. software though. For an example, here I make a random map to compare with the observed patterns. The grammar automatically recognizes lon*lat*Type and replicates the background outline across each panel. Also I change the size of the overall plot using PAGE statements. I just typically experiment until it looks nice.

*Can use the outline to do small multiples.
COMPUTE HourRand = TRUNC(RV.UNIFORM(0,24)).
RECODE HourRand (0 THRU 3 = 1)(4 THRU 19 = 2)(ELSE = COPY).
VARSTOCASES 
  /MAKE Hour FROM HourBin HourRand
  /INDEX Type.
VALUE LABELS Type 1 'Observed' 2 'Random'.

*Small multiple.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=XMetFish YMetFish Hour Type
    MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(1000px,500px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: XMetFish=col(source(s), name("XMetFish"))
  DATA: YMetFish=col(source(s), name("YMetFish"))
  DATA: Hour=col(source(s), name("Hour"), unit.category())
  DATA: Type=col(source(s), name("Type"), unit.category())
  SOURCE: mapsrc = mapSource(file("C:\\Users\\andrew.wheeler\\Dropbox\\Documents\\BLOG\\Inline_Maps_GGRAPH\\DCOutline.smz"))
  DATA: lon*lat = mapVariables(source(mapsrc))
  COORD: rect(dim(1,2), sameRatio(), wrap())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), null())
  GUIDE: axis(dim(3), opposite())
  GUIDE: legend(aesthetic(aesthetic.color.exterior), label("HourBin"))
  SCALE: linear(dim(1), min(389800), max(408000))
  SCALE: linear(dim(2), min(125000), max(147800))
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color."810f7c"),("2",color."edf8fb"),("20",color."bfd3e6"),("21",color."9ebcda"),
         ("22",color."8c96c6"),("23",color."8856a7")))
  ELEMENT: point(position(XMetFish*YMetFish*Type), color.interior(Hour), shape(shape.square), size(size."8"), 
           transparency.exterior(transparency."1"))
  ELEMENT: polygon(position(lon*lat*Type))
  PAGE: end()
END GPL.

We can see that this extreme amount of clustering is clearly not random.

This example works out quite nice because the micro level areas are a regular grid, so I can simulate a choropleth map look by just using square point markers. Unfortunately, I was not able to figure out how to map areas to merge a map file and an id like you can in VIZTEMPLATE. You can see some of my attempts in the attached code. You can however have multiple mapSource statements, so you could import say a street network, rivers and parks and map a nice background map right in SPSS. Hopefully IBM updates the documentation so I can figure out how to make a choropleth map in inline GPL statements.

Dropbox links blocked in China – just email me

A bit of material I share on the blog is hosted via Dropbox. A current student of mine mentioned that Dropbox links are blocked in China. So when he was home on break he did not have access to materials I linked to on my site. He mentioned Dropbox links have been blocked for around a year (although – his words – youtube and facebook have been blocked for years).

If I post something on the blog, but you do not have access to it, always feel free to send me an email. I will send the materials directly. My email is available via my About page, or on my CV.

It is unfortunate Dropbox links are blocked in China, but I see no reason to change services because of this – since any service I could switch to would potentially share the same fate as Dropbox.

 

Turning SPSS data into Python data

Previously I blogged about how to take Python data and turn it back into SPSS data. Here we are going to do the opposite — turn SPSS data into Python objects. First to start out we will make a simple dataset of three variables.

DATA LIST Free /X Y (2F1.0) Z (A1). 
BEGIN DATA
1 2 A
4 5 B
7 8 C
END DATA.
DATASET NAME Test.
EXECUTE.

To import this data into Python, we need to import the spss class of functions, which then you can read cases from the active dataset using the Cursor attribute. Here is an example of grabbing all of the cases.

*Importing all of the data.
BEGIN PROGRAM Python.
import spss
dataCursor = spss.Cursor()
AllData = dataCursor.fetchall()
dataCursor.close()
print AllData
END PROGRAM.

What this then prints out is ((1.0, 2.0, 'A'), (4.0, 5.0, 'B'), (7.0, 8.0, 'C')), a set of nested tuples. You can also just grab one case by replacing dataCursor.fetchall() with dataCursor.fetchone(), in which case it will just return one tuple.

To only grab particular variables from the list, you can pass a set of indices in the spss.Cursor object. Remember, Python indices start at zero, so if you want the first and second variables in the dataset, you need to grab the 0 and 1 indices.

*Only grabbing certain variables.
BEGIN PROGRAM Python.
dataNum = spss.Cursor([0,1])
spNumbers = dataNum.fetchall()
dataNum.close()
print spNumbers
END PROGRAM.

This subsequently prints out ((1.0, 2.0), (4.0, 5.0), (7.0, 8.0)). When grabbing one variable, you may want just a list of the objects instead of the nested tuples. Here I use list comprehension to turn the resulting tuples for the Z variable into a nice list.

*Converting to a nice list.
BEGIN PROGRAM Python.
dataAlp = spss.Cursor([2])
spAlp = dataAlp.fetchall()
dataAlp.close()
spAlp_list = [i[0] for i in spAlp] #convert to nice list
print spAlp
print spAlp_list
END PROGRAM.

The first print object is (('A',), ('B',), ('C',)), but the second is ['A', 'B', 'C'].

The above code works fine if you know the position of the variable in the file, but if the position can change this won’t work. Here is a one liner to get the variable names of the active dataset and plop them in a list.

*Way to get SPSS variable names.
BEGIN PROGRAM Python.
varList = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
print varList
END PROGRAM.

Now if you have your list of variable names you want, you can figure out the index value. There are two ways to do it, iterate over the list of variable names in the dataset, or iterate over the list of your specified variables. I do the latter here (note this will result in an error if you supply a variable name not in the dataset).

*Find the indices of specific variables.
BEGIN PROGRAM Python.
LookVars = ["X","Z"]
VarInd = [varList.index(i) for i in LookVars]
print VarInd
END PROGRAM.

Now you can just supply VarInd above to the argument for spss.Cursor to grab those variables. Here I wrapped it all up in a function.

*Easy function to use.
BEGIN PROGRAM Python.
import spss
def AllSPSSdat(vars):
  if vars == None:
    varNums = range(spss.GetVariableCount())
  else:
    allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
    varNums = [allvars.index(i) for i in vars]
  data = spss.Cursor(varNums)
  pydata = data.fetchall()
  data.close()
  return pydata
END PROGRAM.

You can either supply a list of variables or None, in the latter case all of the variables are returned.

BEGIN PROGRAM Python.
MyDat = AllSPSSdat(vars=["Y","Z"])
print MyDat
END PROGRAM.

This set of nested tuples is then pretty easy to convert to other Python objects. Panda’s dataframes, Numpy arrays, and NetworkX objects are all one liners. Here is turning the entire dataset into a panda’s data frame.

*Turn into pandas data frame.
BEGIN PROGRAM Python.
import pandas as pd
MyDat = AllSPSSdat(vars=None)
allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
PanDat = pd.DataFrame(list(MyDat),columns=allvars)
print PanDat
END PROGRAM.

Which prints out.

   X  Y  Z 
0  1  2  A 
1  4  5  B 
2  7  8  C

Blogging in review 2015

For some meta commentary on the blog itself, the blog has continued to grow, surpassing a total of 100,000 cumulative site views since its inception in December 2011 (according to the stats that wordpress keeps). I added a total of 50 posts & pages in 2015, for a total of 170. The monthly growth is shown below, and it now appears linear. (This data was a few days short of the 1st, so December ended up cracking 4,000 site views in total.)

Pretty much all of my top posts are SPSS related, and were posted in years prior to 2015. Currently I average around 140 views per day, but it is spread out over many of the pages. Below is a table of the average views per day for my most popular pages. Note that the average site views of my home page are currently much higher, but it is dated to when I created the blog. (Created as of 12/28/15, so a few days short of the new year.)

Top Pages
Title Date Posted Views Days Since Posted Average Views Per Day URL
1 Home page / Archives 12/15/11 20048 1474 13.60 https://andrewpwheeler.wordpress.com/
2 Odds Ratios NEED To Be Graphed On Log Scales 10/26/13 6460 793 8.15 https://andrewpwheeler.wordpress.com/2013/10/26/odds-ratios-need-to-be-graphed-on-log-scales/
3 Comparing continuous distributions of unequal size groups in SPSS 04/29/12 6722 1338 5.02 https://andrewpwheeler.wordpress.com/2012/04/29/comparing-continuous-distributions-of-unequal-size-groups-in-spss/
4 Using the Google Places API in Python 05/15/14 2339 592 3.95 https://andrewpwheeler.wordpress.com/2014/5/15/using-the-google-places-api-in-python/
5 Hacking the default SPSS chart template 01/03/12 5705 1455 3.92 https://andrewpwheeler.wordpress.com/2012/01/03/hacking-the-default-spss-chart-template/
6 Why I feel SPSS (or any statistical package) is better than Excel for this particular job 03/30/13 3786 1003 3.77 https://andrewpwheeler.wordpress.com/2013/3/30/why-i-feel-spss-(or-any-statistical-package)-is-better-than-excel-for-this-particular-job/
7 Avoid Dynamite Plots! Visualizing dot plots with super-imposed confidence intervals in SPSS and R 02/20/12 4868 1407 3.46 https://andrewpwheeler.wordpress.com/2012/02/20/avoid-dynamite-plots-visualizing-dot-plots-with-super-imposed-confidence-intervals-in-spss-and-r/
8 Using sequential case processing for data management in SPSS 02/18/13 3539 1043 3.39 https://andrewpwheeler.wordpress.com/2013/2/18/using-sequential-case-processing-for-data-management-in-spss/

So most of the posts go relatively unnoticed, and even when they do get shared it at best a few hundred views in a day or two after it is posted. Being on my home page probably gets all of my posts abit of exposure for a week or two. But looking at the long haul many of my tutorial SPSS, python, and R posts get a fair amount of traffic. Or at least enough to continue to motivate me to write more posts!

As always, I don’t have a real fixed schedule I plan to write posts, nor any real roadmap of what I plan to blog about. I will say though doing my tour on the job market it has been interesting to get some recognition for what I post on the blog. It is mainly students who have asked me about it, but I’ve have some folks mention it at conferences as well.

Using Python to grab Google Street View imagery

Update: I have been asked several times over the years for help on this. In response I have made a simple GUI tool, given a list of addresses and a download folder location, will download all of the images. Applications are websites listing properties and marketers for mailings.

The tool costs $300, Check out the CRIME De-Coder Store to purchase.


I am at it again with using Google data. For a few projects I was interested in downloading street view imagery data. It has been used in criminal justice applications as a free source for second hand systematic social observation by having people code aspects of disorder from the imagery (instead of going in person) (Quinn et al., 2014), as estimates of the ambient walking around population (Yin et al., 2015), and examining criminogenic aspects of the built environment (Vandeviver, 2014).

I think it is just a cool source of data though to be honest. See for example Phil Cohen’s Family Inequality post in which he shows examples of auctioned houses in Detroit over time.

Using the Google Street View image API you can submit either a set of coordinates or an address and have the latest street view image returned locally. This ends up being abit simpler than my prior examples (such as the street distance API or the places API) because it just returns the image blob, no need to parse JSON.

Below is a simple example in python, using a set of addresses in Detroit that are part of a land bank. This function takes an address and a location to download the file, then saves the resulting jpeg to your folder of choice. I defaulted for the image to be 1200×800 pixels.

import urllib, os

myloc = r"C:\Users\andrew.wheeler\Dropbox\Public\ExampleStreetView" #replace with your own location
key = "&key=" + "" #got banned after ~100 requests with no key

def GetStreet(Add,SaveLoc):
  base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
  MyUrl = base + urllib.quote_plus(Add) + key #added url encoding
  fi = Add + ".jpg"
  urllib.urlretrieve(MyUrl, os.path.join(SaveLoc,fi))

Tests = ["457 West Robinwood Street, Detroit, Michigan 48203",
         "1520 West Philadelphia, Detroit, Michigan 48206",
         "2292 Grand, Detroit, Michigan 48238",
         "15414 Wabash Street, Detroit, Michigan 48238",
         "15867 Log Cabin, Detroit, Michigan 48238",
         "3317 Cody Street, Detroit, Michigan 48212",
         "14214 Arlington Street, Detroit, Michigan 48212"]

for i in Tests:
  GetStreet(Add=i,SaveLoc=myloc)

Dropbox has a nice mosaic view for a folder of pictures, you can view all seven photos here. Here is the 457 West Robinwood Street picture:

In my tests my IP got banned after around 100 images, but you can get a verified google account which allows 25,000 image downloads per day. Unfortunately the automatic API only returns the most recent image – there is no way to return older imagery nor know the date-stamp of the current image. (You technically could download the historical data if you know the pano id for the image. I don’t see any way though to know the available pano id’s though.) Update — as of 2018 there is now a Date associated with the image, specifically a Year-Month, but no more specific than that. Not being able to figure out historical pano id’s is still a problem as far as I can tell as well.

But this is definitely easier for social scientists wishing to code images as opposed to going into the online maps. Hopefully the API gets extended to have dates and a second API to return info. on what image dates are available. I’m not sure if Mike Bader’s software app is actually in the works, but for computer scientists there is a potential overlap with social scientists to do feature extraction of various social characteristics, in addition to manual coding of the images.


Update: here is a version that works for python 3+. Currently you need to have a key, no more getting a few free images before being cut off.

# For Python Versions 3+
# Tested V3.10 on 12/15/2021
import os
import urllib.parse
import urllib.request

myloc = r"??????????????" #replace with your own location
key = "&key=" + "????????????" #you need an actual key now!!

def GetStreet(Add,SaveLoc):
  base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
  MyUrl = base + urllib.parse.quote_plus(Add) + key #added url encoding
  fi = Add + ".jpg"
  urllib.request.urlretrieve(MyUrl, os.path.join(SaveLoc,fi))

Tests = ["457 West Robinwood Street, Detroit, Michigan 48203",
         "1520 West Philadelphia, Detroit, Michigan 48206",
         "2292 Grand, Detroit, Michigan 48238",
         "15414 Wabash Street, Detroit, Michigan 48238",
         "15867 Log Cabin, Detroit, Michigan 48238",
         "3317 Cody Street, Detroit, Michigan 48212",
         "14214 Arlington Street, Detroit, Michigan 48212"]

for i in Tests:
  GetStreet(Add=i,SaveLoc=myloc)