Geocoding with census data and the Census API

For my online GIS class I have a tutorial on creating an address locator using street centerline data in ArcGIS. Eventually I would like to put all of my class online, but for now I am just sharing that one, as I’ve forwarded it alot recently.

That tutorial used local street centerline data in Dallas that you can download from Dallas’s open data site. It also gives directions on how to use an online ESRI geocoding service — which Dallas has. But what if those are not an option? A student recently wanted to geocode data from San Antonio, and the only street data file they publicly provide lacks the beginning and ending street number.

That data is insufficient to create an address locator. It is also the case that the road data you can download from the census’s web interface lacks this data. But you can download street centerline data with beginning and end addresses from the census from the FTP site. For example here is the url that contains the streets with the address features. To use that you just have to figure out what state and county you are interested in downloaded. The census even has ESRI address locators already made for you using 2012 data at the state level. Again you just need to figure out your states number and download it.

Once you download the data with the begin and ending street numbers you can follow along with that tutorial the same as the public data.

Previously I’ve written about using the Google geocoding API. If you just have crime data from one jurisdiction, it is simple to make a geocoder for just that locality. But if you have data for many cities (say if you were geocoding home addresses) this can be more difficult. An alternative online API to google that does not have daily limits is the Census Geocoding API.

Here is a simple example in R of calling the census API and geocoding a list of addresses.


get_CensusAdd <- function(street,city,state,zip,benchmark=4){
    base <- ""
    soup <- GET(url=base,query=list(street=street,city=city,state=state,zip=zip,format='json',benchmark=benchmark))
    dat <- fromJSON(content(soup,as='text'), simplifyVector=TRUE)
    D_dat <- dat$result$addressMatches
    if (length(D_dat) > 1){
    return(c(D_dat['matchedAddress'],D_dat['coordinates'][[1]])) #error will just return null, x[1] is lon, x[2] is lat
    else {return(c('',NA,NA))}

#now create function to loop over data frame and return set of addresses
geo_CensusTIGER <- function(street,city,state,zip,sleep=1,benchmark=4){
  #make empy matrix
  l <- length(street)
  MyDat <- data.frame(matrix(nrow=l,ncol=3))
  names(MyDat) <- c("MatchedAdd","Lon","Lat")
  for (i in 1:l){
    x <- suppressMessages(get_CensusAdd(street=street[i],city=city[i],state=state[i],zip=zip[i],benchmark=benchmark))
    if (length(x) > 0){
        MyDat[i,1] <- x[1]
        MyDat[i,2] <- x[2]
        MyDat[i,3] <- x[3]
  MyDat$street <- street
  MyDat$city <- city
  MyDat$zip <- zip
  MyDat$state <- state

## Arbitrary dataframe for an exercise
AddList <- data.frame(
  IdNum = c(1,2,3,4,5),
  Address = c("450 W Harwood Rd", "2878 Fake St", "2775 N Collin St", "2775 N Collins St", "Lakewood Blvd and W Shore Dr"),
  City = c("Hurst", "Richardson", "Arlington", "Arlington", "Dallas"),
  State = c("TX", "TX", "TX", "TX", "TX")

test <- geo_CensusTIGER(street=AddList$Address,city=AddList$City,state=AddList$State,zip=rep('',5))

If you check out the results, you will see that this API does not appear to do fuzzy matching. 2775 N Collin St failed, whereas 2775 N Collins St was able to return a match. You can also see though it will return an intersection, but in my tests "/" did not work (so in R you can simply use gsub to replace different intersection types with and). I haven’t experimented with it too much, so let me know if you have any other insight into this API.

I will follow up in another post a python function to use the Census geocoding API, as well as using the Nominatim online geocoding API, which you can use for addresses outside of the United States.

Using the Google Geocoding API with Python

Previously I posted how to use the geopy python library to call the Google geocode API. But somewhere along the way my version of geopy was not working (maybe because the API changed). Instead of figuring out that problem, I just wrote my own function to call the Google API directly. No need to worry about installing geopy.

Part of the reason I blog is so I have notes for myself – I’m pretty sure I’ve rewritten this several times for different quick geocoding projects, but I couldn’t find them this morning when I needed to do it again. So here is a blog post for my own future reference.

Here is the function, it takes as input the full string address. Also I was getting back some null responses by rapid fire calling the API (with only 27 addresses), so I set the function to delay for five seconds and that seemed to fix that problem.

import urllib, json, time
def GoogGeoAPI(address,api="",delay=5):
  base = r""
  addP = "address=" + address.replace(" ","+")
  GeoUrl = base + addP + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  if jsonData['status'] == 'OK':
    resu = jsonData['results'][0]
    finList = [resu['formatted_address'],resu['geometry']['location']['lat'],resu['geometry']['location']['lng']]
    finList = [None,None,None]
  time.sleep(delay) #in seconds
  return finList

And here is an example use of the function. It returns the formatted address, the latitude and the longitude.

#Example Use
test = r"1600 Amphitheatre Parkway, Mountain View, CA"
geoR = GoogGeoAPI(address=test)
print geoR

This works for a few addresses without an API key. Even with an API key though the limit I believe is 2,500 – so don’t use this to geocode a large list. Also if you have some special characters in your address field this will take more work. For example if you have an & for an intersection I bet this url call will fail. But that should not be too hard to deal with. Also note the terms of service for using the API (which I don’t understand – so don’t ask me!)

I should eventually wrap up all of this google API python code into an extension for SPSS. Don’t hold your breath though for me getting the time to do that.

Here is an update for Python 3+ (the urllib library changed a bit). Also shows how to extract out the postal code.

#Update For Python 3+
#Also includes example parsing out the postal code
import urllib.request, urllib.parse 
import json, time
key = r'???!!!your key here!!!!????'

def GoogGeoAPI(address,api="",delay=3):
  base = r""
  addP = "address=" + urllib.parse.quote_plus(address)
  GeoUrl = base + addP + "&key=" + api
  response = urllib.request.urlopen(GeoUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  if jsonData['status'] == 'OK':
    resu = jsonData['results'][0]
    post_code = -1
    for i in resu['address_components']:
      if i['types'][0] == 'postal_code':
        post_code = i['long_name'] #not sure if everything always has a long name?
    finList = [resu['formatted_address'],resu['geometry']['location']['lat'],resu['geometry']['location']['lng'],post_code]
    finList = [None,None,None,None]
  time.sleep(delay) #in seconds
  return finList
test = r"1600 Amphitheatre Parkway, Mountain View, CA"
geoR = GoogGeoAPI(address=test,api=key,delay=0)

Online geocoding in R using the NYS GIS server

Previously I wrote a post on using the NYS ESRI geocoding server in python. I recently wrote a function in R to do the same. The base url server has changed since I wrote the Python post, but it is easy to update that (the JSON returned doesn’t change.) This should also be simple to update for other ESRI servers, just change the base variable in the first function. This uses the httr package to get the url and the jsonlite package to parse the response.

#Functions for geocoding using online NYS GIS Esri API,

#getting a single address, WKID 4326 is WGS 1984, so returns lat/lon
get_NYSAdd <- function(address,WKID='4326'){
  base <- ""
  soup <- GET(url=base,query=list(SingleLine=address,maxLocations='1',outSR=WKID,f='pjson'))
  dat <- fromJSON(content(soup,as='text'),simplifyVector=TRUE)$candidates
#looping over a vector of addresses, parsing, and returning a data frame
geo_NYSAdd <- function(addresses,...){
  #make empy matrix
  l <- length(addresses)
  MyDat <- data.frame(matrix(nrow=l,ncol=3))
  names(MyDat) <- c("Address","Lon","Lat")
  for (i in 1:l){
    x <- get_NYSAdd(address=addresses[i],...)
    if (length(x) > 0){
      MyDat[i,1] <- x[,1]
      MyDat[i,2] <- x[,2][1]
      MyDat[i,3] <- x[,2][2]
  MyDat$OrigAdd <- addresses

The first function takes a single address, gets and parses the returning JSON. The second function loops over a list of addresses and returns a data frame with the original addresses, the matched address, and the lat/lon coordinates. I use a loop instead of an apply type function because with the web server you really shouldn’t submit large jobs that it would take along time anyway. The NYS server is free and has no 2,500 limit, but I wouldn’t submit jobs much bigger than that though.

AddList <- c("100 Washington Ave, Albany, NY","100 Washington Ave Ext, Albany, NY",
             "421 New Karner Rd., Albany, NY","Washington Ave. and Lark St., Albany, NY","poop")
GeoAddresses <- geo_NYSAdd(addresses=AddList)

We can compare these to what the google geocoding api returns (using the ggmap package):

googleAddresses <- geocode(AddList,source="google")
GeoAddresses$G_lon <- googleAddresses$lon
GeoAddresses$G_lat <- googleAddresses$lat

And we can see that the nonsense "poop" address was actually geocoded! See some similar related funny results from the google maps geocoding via StackMaps.

We can also see some confusion between Washington Ave. Ext as well. The NYS online server should theoretically have more up to date data than Google, but as above shows it is not always better. To do geocoding well takes some serious time to examine the initial addresses and the resulting coordinates in my experience.

To calculate the great circle distance between the coordinates we can use the spDists function in the sp library.

spDists(x = as.matrix(GeoAddresses[1:4,c("Lon","Lat")]),
        y = as.matrix(GeoAddresses[1:4,c("G_lon","G_lat")]),
        longlat=TRUE,diagonal=TRUE) #distance in kilometers

But really, we should just project the data and calculate the Euclidean distance (see the proj4 library). Note that using the law of cosines is typically not recommended for very small distances, so the last distance is suspect. (For reference I point to some resources and analysis showing how to calculate great circle distances in SPSS on Nabble recently.)

Using the New York State Online Geocoding API with Python

I’ve been very lucky doing geographic analysis in New York state, as the majority of base map layers I need, and in particular streets centerline files for geocoding, are available statewide at the NYS GIS Clearing house. I’ve written in the past how to use various Google API’s for geo data, and here I will show how one can use the NYS SAM Address database and their ESRI online geocoding service. I explored this because Google’s terms of service are restrictive, and the NYS composite locator should be more comprehensive/up to date in matches (in theory).

So first, this is basically the same as with most online API’s (at least in my limited experience), submit a particular url and get JSON in return. You just then need to parse the JSON for whatever info you need. This is meant to be used within SPSS, but the function works with just a single field address string and returns the single top hit in a list of length 3, with the unicode string address, and then the x and y coordinates. (The function is of course a valid python function, so you could use this in any environment you want.) The coordinates are specified using ESRI’s WKID (see the list for projected and geographic coordinate systems). In the code I have it fixed as WKID 4326, which is WGS 1984, and so returns the longitude and latitude for the address. When the search returns no hits, it just returns a list of [None,None,None].

*Function to use NYS geocoding API.
import urllib, json

def ParsNYGeo(jBlob):
  if not jBlob['candidates']:
    data = [None,None,None]
    add = jBlob['candidates'][0]['address']
    y = jBlob['candidates'][0]['location']['y']
    x = jBlob['candidates'][0]['location']['x']
    data = [add,x,y]
  return data

def NYSGeo(Add, WKID=4326):
  base = ""
  wkid = "&maxLocations=1&outSR=4326"
  end = "&f=pjson"
  mid = Add.replace(' ','+')
  MyUrl = base + mid + wkid + end
  soup = urllib.urlopen(MyUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  MyDat = ParsNYGeo(jsonData)
  return MyDat

t1 = "100 Washington Ave, Albany, NY"
t2 = "100 Washington Ave, Poop"

Out = NYSGeo(t1)
print Out

Empt = NYSGeo(t2)
print Empt

So you can see in the code sample that you need both the street address and the city in one field. And here is a quick example with some data in SPSS. Just the zip code doesn’t return any results. There is some funny results here though in this test run, and yes that Washington Ave. extension has caused me geocoding headaches in the past.

*Example using with SPSS data.
DATA LIST FREE / MyAdd (A100).
"100 Washington Ave, Albany"
"100 Washinton Ave, Albany"
"100 Washington Ave, Albany, NY 12203"
"100 Washington Ave, Albany, NY, 12203"
"100 Washington Ave, Albany, NY 12206"
"100 Washington Ave, Poop"

SPSSINC TRANS RESULT=GeoAdd lon lat TYPE=100 0 0 
  /FORMULA NYSGeo(Add=MyAdd).


Using Python to geocode data in SPSS

This is the first time since I’ve been using SPSS that I have regular access to Python and R programmability in all of the different places I use SPSS (home and multiple work computers). So I’ve been exploring more solutions to use these tools in regular data analysis and work-flows – of course to accomplish things that can not be done directly in native SPSS code.

The example I am going to show today is using geopy, a Python library that places several geocoding API’s all in a convenient set of scripts. So first once geopy is installed you can call Python code within SPSS by placing it within a BEGIN PROGRAM and END PROGRAM blocks. Here is an example modified from geopy’s tutorial.

from geopy import geocoders
g = geocoders.GoogleV3()
place, (lat, lng) = g.geocode("135 Western Ave. Albany, NY")  
a = [place, lat, lng]
print a

Now what we want to do is to geocode some address data that is currently stored in SPSS case data. So here is an example dataset with some addresses in Albany.

DATA LIST LIST ("|") / Address (A100).
135 Western Ave. Albany, NY
Western Ave. and Quail St Albany, NY
325 Western Ave. Albany, NY

Here I will use the handy SPSSINC TRANS function (provided when installing Python programmability – and as of SPSS 22 installed by default with SPSS) to return the geocoded coordinates using the Google API. The geocode function from geopy does not return the data in an array exactly how I want it, so what I do is create my own function, named g, and it coerces the individual objects (place, lat and lng) into an array and returns that.

from geopy import geocoders
def g(a):
  g = geocoders.GoogleV3()
  place, (lat, lng) = g.geocode(a)
  return [place, lat, lng]
print g("135 Western Ave. Albany, NY")

Now I can use the SPSSINC TRANS function to return the associated place string, as well as the latitude and longitude coordinates from Google.

  /FORMULA g(Address).

Pretty easy. Note that (I believe) the Google geocoding API has a limit of 2,500 cases – so don’t go submitting a million cases to be geocoded (use an offline solution for that). Also a mandatory mention should be made of the variable reliability of online geocoding services.