Getting census data over time

A former student recently asked about getting census data over time, in particular for smaller geographies like block groups. My GIS course I teach students the manual way of downloading data year-by-year from the FTP site. That is partially for pedagogical reasons though, I want students to realize the number of variables (there are so many) and how the data is stored by the census for the American Community Survey.

But Census now has a web api, where you can query the data. So if you are familiar with R or python programming, you can get the data in a bit easier fashion. You just need to know the years + census geographies + variables. I have notes on variables I often use for crim research, but going to the FTP site you can find the big documents or the excel templates.

I have honestly avoided these APIs in my workflows for several years, as my experience with the Census geocoding API was quite flaky, but I have not had the same problems with the APIs for querying the data. Here are examples in R (tidycensus library) and python (census library) of downloading several variables over the 2014-2019 span.

#############################
# R code
library(tidycensus)

# sign up for census key#
# https://api.census.gov/data/key_signup.html
census_api_key(key='????yourkeyhere????')

# place to store results and combine them
years <- 2014:2019
res <- vector("list",length(years))
names(res) <- years

# variables that you want
#        Tot Pop     White non-Hisp  FemHeadHouse  FamPoverty
vars <- c('B03001_001','B03002_003','B11003_016','B17010_002')

# loop over years, save data
# could also apply county filter, see help(get_acs)
# using smaller Deleware just for example
for (y in years){
    # download data
    ld <- as.data.frame(get_acs(year = y,
                                geography='cbg',
                                survey='acs5',
                                variables = vars,
                                state="DE"))
    # reshape long to wide
    ld2 <- reshape(ld,
                   idvar="GEOID",
                   timevar="variable",
                   direction="wide",
                   drop=c("NAME","moe"))
    # insert into list and add in year
    res[[y]] <- ld2
    res[[y]]$year <- y
}

# Combining the data frames together for final analysis
combo <- do.call("rbind",res)
head(combo) # can see B03001_001 is missing for block groups
summary(combo)
#############################

So in R you can ask for a variable, but if it is not available you will just get missing. So you need to make sure the variables you ask for are available over the time span.

The python census library will just straight up give you an error if the variable is not available. Also you need to specify E/M estimates, not just the base variable.

#############################
# Python code

from census import Census
import pandas as pd

key = '????yourkeyhere????'
c = Census(key)
# will get error with unknown variable
# need to specify E/M for estimate or margin of error
vars = ['B03002_003E','B11003_016E','B17010_002E']
res = []

for y in range(2014,2019+1):
    # '10' is Delaware, first '*' is county, second '*' is specific
    # geoid for a block group
    lk = c.acs5.state_county_blockgroup(vars, '10', "*", "*",year=y)
    ld = pd.DataFrame(lk)
    ld['year'] = y
    res.append(ld)

combo = pd.concat(res,axis=0)
combo.head()
#############################

(Initial post had an error not passing in year into the download function, now the two results are the same.)

For making reproducible scripts, instead of putting your API key into the code, a common way is to create a config file with the API key (don’t upload the config file to github), and then read in the config file into your script. (Another way is to use environment variables as secrets, I think the config is easier for people to grok though.)

Another friend recently referred me to requests-cache library. It is a good idea to only download the data locally once, then use that local data. No need to requery the data every time you update your code. Easiest approach is to just have a special script to download the data and save it (in a database or csv files would work here), and then later scripts work with that local data.

Downloading geo files from Census FTP using python

I was working with some health data that only has MSA identifiers the other day. Not many people seem to know about the US Census’s FTP data site. Over the years they have had various terrible GUI’s to download data, but I almost always just go to the FTP site directly.

For geo data, check out https://www2.census.gov/geo/tiger/TIGER2019/ for example. Python for pandas/geopandas also has the nicety that you can point to a url (even a url of a zip file), and load in the data in memory. So to get the MSA areas was very simple:

# Example download MSA
import geopandas as gpd
from matplotlib import pyplot as plt

url_msa = r'https://www2.census.gov/geo/tiger/TIGER2019/CBSA/tl_2019_us_cbsa.zip'
msa = gpd.read_file(url_msa)
msa.plot()
plt.show()

Sometimes the census has files spread across multiple states. So here is an example of doing some simple scraping to get all of the census tracts in the US. You can combine the geopandas dataframes the same as pandas dataframes using pd.concat:

# Example scraping all of the zip urls on a page
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests

def get_zip(url):
    front_page = requests.get(url,verify=False)
    soup = BeautifulSoup(front_page.content,'html.parser')
    zf = soup.find_all("a",href=re.compile(r"zip"))
    # Maybe should use href 
    zl = [os.path.join(url,i['href']) for i in zf]
    return zl

base_url = r'https://www2.census.gov/geo/tiger/TIGER2019/TRACT/'
res = get_zip(base_url)

geo_tract = []
for surl in res:
    geo_tract.append(gpd.read_file(surl))

geo_full = pd.concat(geo_tract)

# See State FIPS codes
# https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696

geo_full[geo_full['STATEFP'] == '01'].plot()
plt.show()

Unfortunately for the census data tables, such as https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/5_year_seq_by_state/Alabama/Tracts_Block_Groups_Only/, those zip files contain two files (an estimate file and a margin of error file), so you cannot just do pd.read_csv(url) for those tables. But for the shapefile zip files this appears to work just fine and dandy.

I am currently working on a project at work (but Gainwell has given me the thumbs up to open source it) to build tables to create the CDC’s Social Vulnerability Index, which I can build for multiple geographies in combo with the census data. So hopefully in the next few weeks will be able to share that work.

Some ACS download helpers and Research Software Papers

The blog has been a bit sparse recently, as moving has been kicking my butt (hanging up curtains and recycling 100 boxes today!). So just a few quick notes.

Downloading ACS Data

First, I have posted some helper functions to work with American Community Survey data (ACS) in python. For a quick overview, if you import/define those functions, here is a quick example of downloading the 2019 Texas micro level files (for census tracts and block groups) from the census FTP site. Can pipe in another year (if available) and and whatever state into the function.

# Python code to download American Community Survey data
base = r'??????' #put your path here where you want to download data
temp = os.path.join(base,'2019_5yr_Summary_FileTemplates')
data = os.path.join(base,'tables')

get_acs5yr(2019,'Texas',base)

Some locations have census tract data to download, I think the FTP site is the only place to download block group data though. And then based on those files you downloaded, you can then grab the variables you want, and here I show selecting out the block groups from those fields:

interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B11003_013','B14006_002','B01001_003','B23025_005',
            'B22010_002','B16002_004','GEOID','NAME']
labs, comp_tabs = merge_tabs(interest,temp,data)
bg = comp_tabs['NAME'].str.find('Block Group') == 0

Then based on that data, I have an additional helper function to calculate proportions given two lists of the numerators and denominators that you want:

top = ['B17010_002',['B11003_016','B11003_013'],'B08141_002']
bot = ['B17010_001',        'B11002_001'       ,'B08141_001']
nam = ['PovertyFamily','SingleHeadwithKids','NoCarWorkers']
prep_sdh = prop_prep(bg, top, bot, nam)

So here to do Single Headed Households with kids, you need to add in two fields for the numerator ['B11003_016','B11003_013']. I actually initially did this example with census tract data, so not sure if all of these fields are available at the block group level.

I have been doing some work on demographics looking at the social determinants of health (see SVI data download, definitions), hence the work with census data. I have posted my prior example fields I use from the census, but criminologists may just use the social-vulnerability-index from the CDC – it is essentially the same as how people typically define social disorganization.

Peer Review for Criminology Software

Second, jumping the gun a bit on this, but in the works is an overlay journal for CrimRxiv. Part of the contributions we will accept are software contributions, e.g. if you write an R package to do some type of analysis function common in criminology.

It is still in the works, but we have some details up currently and a template for submission (I need to work on a markdown template, currently just a word doc). High level I wanted something like the Journal of Statistical Software or the Journal of Open Source Software (I do not think the level of detail of JSS is necessary, but wanted an example use case, which JoSS does not have).

Just get in touch if you have questions whether your work is on topic. Aim is to be more open to contributions at first. Really excited about this, as publicly sharing code is currently a thankless prospect. Having a peer reviewed venue for such code contributions for criminologists fills a very important role that traditional journals do not.

Future Posts?

Hopefully can steal some time to continue writing posts here and there, but will definitely be busy getting the house in order in the next month. Hoping to do some work on mapping grids and KDE in python/geopandas, and writing about the relationship between healthcare data and police incident report data are two topics I hope to get some time to work on in the near future for the blog.

If folks have requests for particular topics on the blog though feel free to let me know in the comments or via email!

Buffers and hospital deserts with geopandas

Just a quick blog post today. As a bit of a side project at work I have been looking into medical service provider deserts. Most people simply use a geographic cutoff of say 1 mile (see Wissah et al., 2020 for example for Pharmacy deserts). Also for CJ folks, John Hipp has done some related work for parolees being nearby service providers (Hipp et al., 2009; 2011), measuring nearby as 2 miles.

So I wrote some code to calculate nice sequential buffer areas and dissolve them in geopandas. Files and code to showcase are here on GitHub. First, as an example dataset, I geocode (using the census geocoding API) CMS certified Home Healthcare facilities, so these are hospice facilities. To see a map of those facilities across the US, and you can click on the button to get info, go to here, CMS HOME FACILITY MAP. Below is a screenshot:

Next I then generate sequential buffers in kilometers of 2, 4, 8, 16, and then the leftover (just for Texas). So you can then zoom in and darker areas are at a higher risk of not having a hospice facility nearby. HOSPICE DESERT MAP

Plotting some of these in Folium were giving me fits, so I will need to familiarize myself with that more in the future. The buffers for the full US as well were giving me trouble (these just for Texas result in fairly large files, surprised Github doesn’t yell at me for them being too big).

Going forward, what I want to do is instead of relying on a fixed function of distance, is to fit a model to identify individuals probability of going to the doctor based on distance. So instead of just saying 1+ mile and you are at high risk, fit a function that defines that distance based on behavioral data (maybe using insurance claims). Also I think the distances matter quite a bit for urban/rural and car/no-car. So rural folks traveling a mile is not a big deal, since you need a car to really do anything in rural areas. But for folks in the city relying on public transportation going a mile or two is a bigger deal.

The model then would be similar to the work I did with Gio on gunshot death risk (Circo & Wheeler, 2020), although I imagine the model would spatially vary (so maybe geographically weighted regression may work out well).

References