Downloading Social Vulnerability Index data

So Gainwell has let me open source one of the projects I have been working on at work – a python package to download SVI data. The SVI is an index created by the CDC to identify areas of high health risk in four domains based on census data (from the American Community Survey).

For my criminologist friends, these are very similar variables we typically use to measure social disorganization (see Wheeler et al., 2018 for one example criminology use case). It is a simple python install, pip install svi-data. And then you can get to work. Here is a simple example downloading zip code data for the entire US.

import numpy as np
import pandas as pd
import svi_data

# Need to sign up for your own key
key = svi_data.get_key('census_api.txt')

# Download the data from census API
svi_zips = svi_data.get_svi(key,'zip',2019)
svi_zips['zipcode'] = svi_zips['GEO_ID'].str[-5:]

Note I deviate from the CDC definition in a few ways. One is that when I create the themes, instead of using percentile rankings, I z-score the variables instead. It will likely result in very similar correlations, but this is somewhat more generalizable across different samples. (I also change the denominator for single parent heads of households to number of families instead of number of households, I think that is likely just an error on CDC’s part.)

Summed Index vs PCA

So here quick, lets check out my z-score approach versus a factor analytic approach via PCA. Here I just focus on the poverty theme:

pov_vars = ['EP_POV','EP_UNEMP','EP_PCI','EP_NOHSDP','RPL_THEME1']
svi_pov = svi_zips[['zipcode'] + pov_vars ].copy()

from sklearn import decomposition
from sklearn.preprocessing import scale

svi_pov.corr()

Note the per capita income has a negative correlation, but you can see the index works as expected – lower correlations for each individual item, but fairly high correlation with the summed index.

Lets see what the index would look like if we used PCA instead:

pca = decomposition.PCA()
sd = scale(svi_pov[pov_vars[:-1]])
pc = pca.fit_transform(sd)
svi_pov['PC1'] = pc[:,0]
svi_pov.corr() #almost perfect correlation

You can see that besides the negative value, we have an almost perfect correlation between the first principal component vs the simpler sum score.

One benefit of PCA though is a bit more of a structured approach to understand the resulting indices. So we can see via the Eigen values that the first PC only explains about 50% of the variance.

print(pca.explained_variance_ratio_)

And if we look at the loadings, we can see a more complicated pattern of residual loadings for each sucessive factor.

comps = pca.components_.T
cols = ['PC' + str(i+1) for i in range(comps.shape[0])]
load_dat = pd.DataFrame(comps,columns=cols,index=pov_vars[:-1])
print(load_dat)

So for PC3 for example, it has areas with high no highschool, as well as high per capita income. So higher level components can potentially identify more weird scenarios, which healthcare providers probably don’t care about so much by is a useful thing to know for exploratory data analysis.

Mapping

Since these are via census geographies, we can of course map them. (Here I grab zipcodes, but the code can download counties or census tracts as well.)

We can download the census geo data directly into geopandas dataframe. Here I download the zip code tabulation areas, grab the outline of Raleigh, and then only plot zips that intersect with Raleigh.

import geopandas as gpd
import matplotlib.pyplot as plt

# Getting the spatial zipcode tabulation areas
zip_url = r'https://www2.census.gov/geo/tiger/TIGER2019/ZCTA5/tl_2019_us_zcta510.zip'
zip_geo = gpd.read_file(zip_url)
zip_geo.rename(columns={'GEOID10':'zipcode'},inplace=True)

# Merging in the SVI data
zg = zip_geo.merge(svi_pov,on='zipcode')

# Getting outline for Raleigh
ncp_url = r'https://www2.census.gov/geo/tiger/TIGER2019/PLACE/tl_2019_37_place.zip'
ncp_geo = gpd.read_file(ncp_url)
ral = ncp_geo[ncp_geo['NAME'] == 'Raleigh'].copy()
ral_proj = 'EPSG:2278'
ral_bord = ral.to_crs(ral_proj)

ral_zips = gpd.sjoin(zg,ral,how='left')
ral_zips = ral_zips[~ral_zips['index_right'].isna()].copy()
ral_zipprof = ral_zips.to_crs(ral_proj)

# Making a nice geopandas static map, zoomed into Raleigh

fig, ax = plt.subplots(figsize=(6,6), dpi=100)

# Raleighs boundary is crazy
#ral_bord.boundary.plot(color='k', linewidth=1, edgecolor='k', ax=ax, label='Raleigh')
ral_zipprof.plot(column='RPL_THEME1', cmap='PRGn',
                 legend=True,
                 edgecolor='grey',
                 ax=ax)

# via https://stackoverflow.com/a/42214156/604456
ral_zipprof.apply(lambda x: ax.annotate(text=x['zipcode'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1)

ax.get_xaxis().set_ticks([])
ax.get_yaxis().set_ticks([])

plt.show()

I prefer to use smaller geographies when possible, so I think zipcodes are about the largest areas that are reasonable to use this for (although I do have the ability to download this for counties). Zipcodes since they don’t nicely overlap city boundaries can cause particular issues in data analysis as well (Grubesic, 2008).

Other Stuff

I have a notebook in the github repo showing how to grab census tracts, as well as how to modify the exact variables you can download.

It does allow you to specify a year as well (in the notebook I show you can do the 2018 SVI for the 16/17/18/19 data at least). Offhand for these small geographies I would only expect small changes over time (see Miles et al., 2016 for an example looking at SES).

One of the things I think has more value added (and hopefully can get some time to do more on this at Gainwell), is to peg these metrics to actual health outcomes – so instead of making an index for SES, look at micro level demographics for health outcomes, and then post-stratify based on census data to get estimates across the US. But that being said, the SVI often does have reasonable correlations to actual geospatial health outcomes, see Learnihan et al. (2022) for one example that medication adherence the SVI is a better predictor than distance for pharmacy for example.

References

Use circles instead of choropleth for MSAs

We are homeschooling the kiddo at the moment (the plunge was reading by Bryan Caplan’s approach, and seeing with online schooling just how poor middle school education was). Wife is going through AP biology at the moment, and we looked up various job info on biomedical careers. Subsequently came across this gem of a map of MSA estimates from the Bureau of Labor Stats (BLS) Occupational Employment and Wage Stats series (OES).

I was actually mapping some metro stat areas (MSAs) at work the other day, and these are just terrifically bad geo areas to show via a choropleth map. All choropleth maps have the issue of varying size areas, but I never realized having somewhat regular borders (more straight lines) makes the state and county maps not so bad – these MSA areas though are tough to look at. (Wife says it scintillates for her if she looks too closely.)

There are various incredibly tiny MSAs next to giant ones that you will just never see in these maps (no matter what color scheme you use). Nevada confused for me quite a bit, until I zoomed in to see that there are 4 areas, Reno is just a tiny squib.

Another example is Boulder above Denver. (Look closely at the BLS map I linked, you can just make out Boulder if you squint, but I cannot tell what color it corresponds to in the legend.) The outline heavy OES maps, which are mostly missing data, are just hopeless to display like this effectively. Reno could be the hottest market for whatever job, and it will always be lost in this map if you show employment via the choropleth approach. So of course I spent the weekend hacking together some maps in python and folium.

The BLS has a public API, but I was not able to find the OES stats in that. But if you go through the motions of querying the data and muck around in the source code for those queries, you can see they have an undocumented API call to generate json to fill the tables. Then using this tool to convert the json calls to python (thank you Hacker News), I was able to get those tables into python.

I have these functions saved on github, so check out that source for the nitty gritty. But just here quickly, here is a replicated choropleth map, showing the total employees for bio jobs (you can go to here to look up the codes, or run my function bls_maps.ocodes() to get a pandas dataframe of those fields).

# Creating example bls maps
from bls_geo import *

# can check out https://www.bls.gov/oes/current/oes_stru.htm
bio = '172031'
bio_stats = oes_geo(bio)
areas = get_areas() # this takes a few minutes
state = state_albers()
geo_bio = merge_occgeo(bio_stats,areas)

ax = geo_bio.plot(column='Employment',cmap='inferno',legend=True,zorder=2)
state.boundary.plot(ax=ax,color='grey',linewidth=0.5,zorder=1)
ax.set_ylim(0.1*1e6,3.3*1e6)
ax.set_xlim(-0.3*1e7,0.3*1e7)   # lower 48 focus (for Albers proj)
ax.set_axis_off()
plt.show()

And that is not much better than BLSs version. For this data, if you are just interested in looking up or seeing the top metro areas, just doing a table, e.g. above geo_bio.to_excel('biojobs.xlsx'), works just as well as a map.

So I was surprised to see Minneapolis pop up at the top of that list (and also surprised Raleigh doesn’t make the list at all, but Durham has a few jobs). But if you insist on seeing spatial trends, I prefer to go the approach of mapping proportion or graduate circles, placing the points at the centroid of the MSA:

att = ['areaName','Employment','Location Quotient','Employment per 1,000 jobs','Annual mean wage']
form = ['',',.0f','.2f','.2f',',.0f']

map_bio = fol_map(geo_bio,'Employment',['lat', 'lon'],att,form)
#map_bio.save('biomap.html')
map_bio #if in jupyter can render like this

I am too lazy to make a legend, you can check out nbviewer to see an interactive Folium map, which I have tool tips (similar to the hover for the BLS maps).

Forgive my CSS/HTML skills, not sure how to make nicer popups. So you lose the exact areas these MSA cover in this approach, but I really only expect a general sense from these maps anyway.

These functions are general enough for whatever wage series you want (although these functions will likely break when the 2021 data comes out). So here is the OES table for data science jobs:

I feel going for the 90th percentile (mapping that to the 10 times programmer) is a bit too over the top. But I can see myself reasonably justifying 75th percentile. (Unfortunately these agg tables don’t have a way to adjust for years of experience, if you know of a BLS micro product I could do that with let me know!). So you can see here the somewhat inflated salaries for the SanFran Bay area, but not as inflated as many might have you think (and to be clear, these are for 2020 survey estimates).

If we look at map of data science jobs, varying the circles by that 75th annual wage percentile, it looks quite uniform. What happens is we have some real low outliers (wages under 70k), resulting in tiny circles (such as Athen’s GA). Most of the other metro regions though are well over 100k.

In more somber news, those interactive maps are built using Leaflet as the backend, which was create by a Ukranian citizen, Vladimir Agafonkin. We can do amazing things with open source code, but we should always remember it is on the backs of someones labor we are able to do those things.

Downloading geo files from Census FTP using python

I was working with some health data that only has MSA identifiers the other day. Not many people seem to know about the US Census’s FTP data site. Over the years they have had various terrible GUI’s to download data, but I almost always just go to the FTP site directly.

For geo data, check out https://www2.census.gov/geo/tiger/TIGER2019/ for example. Python for pandas/geopandas also has the nicety that you can point to a url (even a url of a zip file), and load in the data in memory. So to get the MSA areas was very simple:

# Example download MSA
import geopandas as gpd
from matplotlib import pyplot as plt

url_msa = r'https://www2.census.gov/geo/tiger/TIGER2019/CBSA/tl_2019_us_cbsa.zip'
msa = gpd.read_file(url_msa)
msa.plot()
plt.show()

Sometimes the census has files spread across multiple states. So here is an example of doing some simple scraping to get all of the census tracts in the US. You can combine the geopandas dataframes the same as pandas dataframes using pd.concat:

# Example scraping all of the zip urls on a page
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests

def get_zip(url):
    front_page = requests.get(url,verify=False)
    soup = BeautifulSoup(front_page.content,'html.parser')
    zf = soup.find_all("a",href=re.compile(r"zip"))
    # Maybe should use href 
    zl = [os.path.join(url,i['href']) for i in zf]
    return zl

base_url = r'https://www2.census.gov/geo/tiger/TIGER2019/TRACT/'
res = get_zip(base_url)

geo_tract = []
for surl in res:
    geo_tract.append(gpd.read_file(surl))

geo_full = pd.concat(geo_tract)

# See State FIPS codes
# https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696

geo_full[geo_full['STATEFP'] == '01'].plot()
plt.show()

Unfortunately for the census data tables, such as https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/5_year_seq_by_state/Alabama/Tracts_Block_Groups_Only/, those zip files contain two files (an estimate file and a margin of error file), so you cannot just do pd.read_csv(url) for those tables. But for the shapefile zip files this appears to work just fine and dandy.

I am currently working on a project at work (but Gainwell has given me the thumbs up to open source it) to build tables to create the CDC’s Social Vulnerability Index, which I can build for multiple geographies in combo with the census data. So hopefully in the next few weeks will be able to share that work.

Optimal and Fair Spatial Siting

A bit of a belated MLK day post. Much of the popular news on predictive or machine learning algorithms has a negative connotation, often that they are racially biased. I tend to think about algorithms though in almost the exact opposite way – we can adjust them to suit our objectives. We just need to articulate what exactly we mean by fair. This goes for predictive policing (Circo & Wheeler, 2021; Liberatore et al., 2021; Mohler et al., 2018; Wheeler, 2020) as much as it does for any application.

I have been reading a bit about spatial fairness in siting health resources recently, one example is the Urban Institutes Equity Data tool. For this tool, you put in where your resources are currently located, and it tells you whether those locations are located in areas that have demographic breakdowns like the overall city. So this uses the container approach (not distance to the resources), which distance traveled to resources is probably a more typical way to evaluate fair spatial access to resources (Hassler & Ceccato, 2021; Koschinsky et al., 2021).

Here what I am going to show is instead of ex-ante saying whether the siting of resources is fair, I construct an integer linear program to site resources in a way we define to be fair. So imagine that we are siting 3 different locations to do rapid Covid testing around a city. Well, we do the typical optimization and minimize the distance traveled for everyone in the city on average to those 3 locations – on average 2 miles. But then we see that white people on average travel 1.9 miles, and minorities travel 2.2 miles. So it that does not seem so fair does it.

I created an integer linear program to take this difference into account, so instead of minimizing average distance, it minimizes:

White_distance + Minority_distance + |White_distance - Minority_distance|

So in our example above, if we had a solution that was white travel 2.1 and minority 2.1, this would be a lower objective value than (4.2), than the original minimize overall travel (1.9 + 2.2 + 0.3 = 4.4). So this gives each minority groups equal weight, as well as penalizes if one group (either whites or minorities) has much larger differences.

I am not going to go into all the details. I have python code that has the functions (it is very similar to my P-median model, Wheeler, 2018). The codes shows an example of siting 5 locations in Dallas (and uses census block group centroids for the demographic data). Here is a map of the results (it has points outside of the city, since block groups don’t perfectly line up with the city boundaries).

In this example, if we choose 5 locations in the city to minimize the overall distance, the average travel is just shy of 3.5 miles. The average travel for white people (not including Hispanics) is 3.25 miles, and for minorities is 3.6 miles. When I use my fair algorithm, the white average distance is 3.5 miles, and the minority average distance is 3.6 miles (minority on average travels under 200 more feet on average than white).

So this is ultimately a trade off – it ends up pushing up the average distance a white person will travel, and only slightly pushes down the minority travel, to balance the overall distances between the two groups. It is often the case though that one can be somewhat more fair, but in only results in slight trade-offs though in the overall objective function (Rodolfa et al., 2021). So that trade off is probably worth it here.

References

ptools feature engineering vignette update

For another update to my ptools R package in progress, I have added a vignette to go over the spatial feature engineering functions I have organized. These include creating vector spatial features (grid cells, hexagons, or Voronoi polygons), as well as RTM style features on the right hand side (e.g. distance to nearest, kernel density estimates at those polygon centroids, different weighted functions ala egohoods, etc.)

If you do install the package turning vignettes on you can see it:

install_github("apwheele/ptools", build_vignettes = TRUE)
vignette("spat-feateng")

Here is an example of hexgrids over NYC (I have datasets for NYC Shootings, NYC boroughs, NYC Outdoor Cafes, and NYC liquor licenses to illustrate the functions).

The individual functions I think are reasonably documented, but it is somewhat annoying to get an overview of them all. If you go to something like “?Documents/R/win-library/4.1/ptools/html/00Index.html” (or wherever your package installation folder is) you can see all of the functions currently in the package in one place (is there a nice way to pull this up using help()?). But between this vignette and the Readme on the front github page you get a pretty good overview of the current package functionality.

I am still flip flopping whether to bother to submit to CRAN. Installing from github is so easy not sure it is worth the hassle while I continually add in new things to the package. And I foresee tinkering with it for an extended period of time.

Always feel free to contribute, I want to not only add more functions, but should continue to do unit tests and add in more vignettes.

Geocoding the CMS NPI Registry (python)

So previously I wrote out creating service deserts. I have since found a nicer data source to use for this, the NPI CMS registry. This data file has over 6 million service providers across the US.

Here I will provide an example of using that data to geocode all the pharmacy’s in Texas, again using the census geocoding API and python.

Chunking up the NPI database

So first, you can again download the entire NPI database from here. So I have already downloaded and unzipped that file, which contains a CSV for the January version, named npidata_pfile_20050523-20210110.csv. So as some upfront, here are the libraries I will be using, and I also set the directory to where my data is located.

###############################
import pandas as pd
import numpy as np
import censusgeocode as cg
import time
from datetime import datetime
import os
os.chdir(r'D:\HospitalData\NPPES_Data_Dissemination_January_2021')
###############################

The file is just a bit too big for me to fit in memory on my machine. On Windows, you can use Get-Content npidata_pfile_20050523-20210110.csv | Measure-Object -Line in powershell to get the line counts, or on Unix use wc -l *.csv for example. So I know the file is not quite 6.7 million rows.

So what I do here is create a function to read in the csv file in chunks, only select the columns and rows that I want, and then return that data frame. In the end, you need to search across all of the Taxonomy codes to pull out the type of service provider you want. So for community pharmacies, the code is 3336C0003X, but it is not always in the first Taxonomy slot (some CVS’s have it in the second slot for example). You can see the big list of taxonomy codes here, so my criminology friends may say be interested in mental health or substance abuse service providers for other examples.

In addition to the taxonomy code, I always select organizations, not individuals (Entity Type = 2). And then I only select out pharmacies in Texas (although I bet you could fit all of the US pharmacies in memory pretty easily, maybe 60k in total?) Caveat emptor, I am not 100% sure how to use the deactivation codes properly in this database, as that data is always NaN for Texas pharmacies.

######################################################################
# Prepping the input data in chunks

keep_col = ['NPI','Entity Type Code','Provider Organization Name (Legal Business Name)',
            'NPI Deactivation Reason Code','NPI Deactivation Date','NPI Reactivation Date',
            'Provider First Line Business Practice Location Address',
            'Provider Business Practice Location Address City Name',
            'Provider Business Practice Location Address State Name',
            'Provider Business Practice Location Address Postal Code']
            
taxon_codes = ['Healthcare Provider Taxonomy Code_' + str(i+1) for i in range(15)]
keep_col += taxon_codes
community_pharm = '3336C0003X'
npi_csv = 'npidata_pfile_20050523-20210110.csv' #Newer files will prob change the name

# This defines the rows I want
def sub_rows(data):
    ec = data['Entity Type Code'] == "2"
    st = data['Provider Business Practice Location Address State Name'] == 'TX'
    ta = (data[taxon_codes] == community_pharm).any(axis=1)
    #ac = data['NPI Deactivation Reason Code'].isna()
    all_together = ec & st & ta #& ac
    sub = data[all_together]
    return sub

def csv_chunks(file,chunk_size,keep_cols,row_sub):
    # First lets get the header and figure out the column indices
    header_fields = list(pd.read_csv(npi_csv, nrows=1))
    header_locs = [header_fields.index(i) for i in keep_cols]
    # Now reading in a chunk of data
    skip = 1
    it_n = 0
    sub_n = 0
    ret_chunk = chunk_size
    fin_li_dat = []
    while ret_chunk == chunk_size:
        file_chunk = pd.read_csv(file, usecols=header_locs, skiprows=skip, 
                     nrows=chunk_size, names=header_fields, dtype='str')
        sub_dat = row_sub(file_chunk)
        fin_li_dat.append( sub_dat.copy() )
        skip += chunk_size
        it_n += 1
        sub_n += sub_dat.shape[0]
        print(f'Grabbed iter {it_n} total sub n so far {sub_n}')
        ret_chunk = file_chunk.shape[0]
    fin_dat = pd.concat(fin_li_dat, axis=0)
    return fin_dat


# Takes about 3 minutes
print( datetime.now() )
pharm_tx = csv_chunks(npi_csv, chunk_size=1000000, keep_cols=keep_col, row_sub=sub_rows)
print( datetime.now() )

# No deactivated codes in all of Texas
print( pharm_tx['NPI Deactivation Reason Code'].value_counts() )
######################################################################

So this ends up returning not quite 6800 pharmacies in all of Texas.

Geocoding using the census API

So first, the address data is pretty well formatted. But for those new to geocoding, if you have end parts of address strings like Apt 21 or Suite C, those endings will typically throw geocoders off the mark. So in just a few minutes, I noted the different strings that marked the parts of the addresses I should chop off, and wrote a function to clean those up. Besides that I just limit the zip code to 5 digits, as that field is a mix of 5 and 9 digit zipcodes.

######################################################################
# Now prepping the data for geocoding

ph_tx = pharm_tx.drop(columns=taxon_codes).reset_index(drop=True)

#['Provider First Line Business Practice Location Address', 'Provider Business Practice Location Address City Name', 'Provider Business Practice Location Address State Name', 'Provider Business Practice Location Address Postal Code']

# I just looked through the files and saw that after these strings are not needed
end_str = [' STE', ' SUITE', ' BLDG', ' TOWER', ', #', ' UNIT',
           ' APT', ' BUILDING',',', '#']

 
def clean_add(address):
    add_new = address.upper()
    for su in end_str:
        sf = address.find(su)
        if sf > -1:
            add_new = add_new[0:sf]
    add_new = add_new.replace('.','')
    add_new = add_new.strip()
    return add_new

# Some examples
clean_add('5700 S GESSNER DR STE G')
clean_add('10701-B WEST BELFORT SUITE 170')
clean_add('100 EAST UNIVERSITY BLVD.')
clean_add('5800 BELLAIRE BLVD BLDG 1')
clean_add('2434 N I-35 # S')

ph_tx['Zip5'] = ph_tx['Provider Business Practice Location Address Postal Code'].str[0:5]
ph_tx['Address'] = ph_tx['Provider First Line Business Practice Location Address'].apply(clean_add)
ph_tx.rename(columns={'Provider Business Practice Location Address City Name':'City',
                      'Provider Business Practice Location Address State Name':'State2'},
             inplace=True)
######################################################################

Next is my function to use the batch geocoding in the census api. Note the census api is a bit finicky – technically the census api says you can do batches of up to 5k rows, but I tend to get kicked off for higher values. So here I have a function that chunks it up into tinier batch portions and submits to the API. (A better function would cache intermediate results and wrap all that jazz in a try function.)

 ######################################################################
 #This function breaks up the input data frame into chunks
 #For the census geocoding api
 def split_geo(df, add, city, state, zipcode, chunk_size=500):
     df_new = df.copy()
     df_new.reset_index(inplace=True)
     splits = np.ceil( df.shape[0]/chunk_size)
     chunk_li = np.array_split(df_new['index'], splits)
     res_li = []
     pick_fi = []
     for i,c in enumerate(chunk_li):
         # Grab data, export to csv
         sub_data = df_new.loc[c, ['index',add,city,state,zipcode]]
         sub_data.to_csv('temp_geo.csv',header=False,index=False)
         # Geo the results and turn back into df
         print(f'Geocoding round {int(i)+1} of {int(splits)}, {datetime.now()}')
         result = cg.addressbatch('temp_geo.csv') #should try/except?
         # May want to dump the intermediate results
         #pi_str = f'pickres_{int(i)}.p'
         #pickle.dump( favorite_color, open( pi_str, "wb" ) )
         #pick_fi.append(pi_str.copy())
         names = list(result[0].keys())
         res_zl = []
         for r in result:
             res_zl.append( list(r.values()) )
         res_df = pd.DataFrame(res_zl, columns=names)
         res_li.append( res_df.copy() )
         time.sleep(10) #sleep 10 seconds to not get cutoff from request
     final_df = pd.concat(res_li)
     final_df.rename(columns={'id':'row'}, inplace=True)
     final_df.reset_index(inplace=True, drop=True)
     # Clean up csv file
     os.remove('temp_geo.csv')
     return final_df
 ######################################################################

And now we are onto the final stage, actually running the geocoding function, and piping the end results to a csv file. (Which you can see the final version here.)

######################################################################
# Geocoding the data in chunks

# Takes around 35 minutes
geo_pharm = split_geo(ph_tx, add='Address', city='City', state='State2', zipcode='Zip5', chunk_size=100)

#What is the geocoding hit rate?
print( geo_pharm['match'].value_counts() )
# Only around 65%

# Now merging back with the original data if you want
# Not quite sorted how I need them
geo_pharm['rowN'] = geo_pharm['row'].astype(int)
gp2 = geo_pharm.sort_values(by='rowN').reset_index(drop=True)

# Fields I want
kg = ['address','match','lat','lon']
kd = ['NPI',
      'Provider Organization Name (Legal Business Name)',
      'Provider First Line Business Practice Location Address',
      'Address','City','State2','Zip5']

final_pharm = pd.concat( [ph_tx[kd], gp2[kg]], axis=1 )

final_pharm.to_csv('Pharmacies_Texas.csv',index=False)
######################################################################

Unfortunately the geocoding hit rate is pretty disappointing, only around 65% in this sample. So if I were using this for a project, I would likely do a round of geocoding using the Google API (which is a bit more unforgiving for varied addresses), or perhaps build my own openstreet map geocoder for the US. (Or in general if you don’t have too many to review, doing it interactively in ArcGIS is very nice as well if you have access to Arc.)

Buffers and hospital deserts with geopandas

Just a quick blog post today. As a bit of a side project at work I have been looking into medical service provider deserts. Most people simply use a geographic cutoff of say 1 mile (see Wissah et al., 2020 for example for Pharmacy deserts). Also for CJ folks, John Hipp has done some related work for parolees being nearby service providers (Hipp et al., 2009; 2011), measuring nearby as 2 miles.

So I wrote some code to calculate nice sequential buffer areas and dissolve them in geopandas. Files and code to showcase are here on GitHub. First, as an example dataset, I geocode (using the census geocoding API) CMS certified Home Healthcare facilities, so these are hospice facilities. To see a map of those facilities across the US, and you can click on the button to get info, go to here, CMS HOME FACILITY MAP. Below is a screenshot:

Next I then generate sequential buffers in kilometers of 2, 4, 8, 16, and then the leftover (just for Texas). So you can then zoom in and darker areas are at a higher risk of not having a hospice facility nearby. HOSPICE DESERT MAP

Plotting some of these in Folium were giving me fits, so I will need to familiarize myself with that more in the future. The buffers for the full US as well were giving me trouble (these just for Texas result in fairly large files, surprised Github doesn’t yell at me for them being too big).

Going forward, what I want to do is instead of relying on a fixed function of distance, is to fit a model to identify individuals probability of going to the doctor based on distance. So instead of just saying 1+ mile and you are at high risk, fit a function that defines that distance based on behavioral data (maybe using insurance claims). Also I think the distances matter quite a bit for urban/rural and car/no-car. So rural folks traveling a mile is not a big deal, since you need a car to really do anything in rural areas. But for folks in the city relying on public transportation going a mile or two is a bigger deal.

The model then would be similar to the work I did with Gio on gunshot death risk (Circo & Wheeler, 2020), although I imagine the model would spatially vary (so maybe geographically weighted regression may work out well).

References

New book: Micro geographic analysis of Chicago homicides, 1965-2017

In joint work with Chris Herrmann and Dick Block, we now have a book out – Understanding Micro-Place Homicide Patterns in Chicago (1965 – 2017). It is a Springer Brief book, so I recommend anyone who has a journal article that is too long that this is a potential venue for the work. (Really this is like the length of three journal articles.)

A few things occurred to prompt me to look into this. First, Chicago increased a big spike of homicides in 2016 and 2017. Here is a graph breaking them down between domestic related homicides and all other homicides. You can see all of the volatility is related to non-domestic homicides.

So this (at least to me) begs the question of whether those spiked homicides show similar characteristics compared to historical homicides. Here we focus on long term spatial patterns and micro place grid cells in the city, 150 by 150 meter cells. Dick & Carolyn Block had collated data, including the address where the body was discovered, using detective case notes starting in 1965 (ending in 2000). The data from 2000 through 2017 is the public incident report data released by Chicago PD online. Although Dick and Carolyn’s public dataset is likely well known at this point, Dick has more detailed data than is released publicly on ICPSR and a few more years (through 2000). Here is a map showing those homicide patterns aggregated over the entire long time period.

So we really have two different broad exploratory analyses we employed in the work. One was to examine homicide clustering, and the other was to examine temporal patterns in homicides. For clustering, we go through a ton of different metrics common in the field, and I introduce even one more, Theil’s decomposition for within/between neighborhood clustering. This shows Theil’s clustering metric within neighborhoods in Chicago (based on the entire time period).

So areas around the loop showed more clustering in homicides, but here it appears it is somewhat confounded with neighborhood size – smaller neighborhoods appear to have more clustering. This is sort of par for the course for these clustering metrics (we go through several different Gini variants as well), in that they are pretty fickle. You do a different temporal slice of data or treat empty grid cells differently the clustering metrics can change quite a bit.

So I personally prefer to focus on long term temporal patterns. Here I estimated group based trajectory models using zero-inflated Poisson models. And here are the predicted outputs for those grid cells over the city. You can see unlike prior work David Weisburd (Seattle), myself (Albany), or Martin Andresen (Vancouver) has done, they are much more wavy patterns. This may be due to looking over a much longer horizon than any of those prior works though have.

The big wave, Group 9, ends up being clearly tied to former large public housing projects, which their demolitions corresponds to the downturn.

I have an interactive map to explore the other trajectory groups here. Unfortunately the others don’t show as clear of patterns as Group 9, so it is difficult to answer any hard questions about the uptick in 2016/2017, you could find evidence of homicides dispersing vs homicides being in the same places but at a higher intensity if you slice the data different ways.

Unfortunately the analysis is never ending. Chicago homicides have again spiked this year, so maybe we will need to redo some analysis to see if the more current trends still hold. I think I will migrate away from the clustering metrics though (Gini and Theil), they appear to be too volatile to say much of anything over short term patterns. I think there may be other point pattern analysis that are more diagnostic to really understand emerging/changing spatial patterns.

The coffee next to the cover image is Chris Herrmann’s beans, so go get yourself some as well at Fellowship Coffee!

Mapping attitudes paper published

My paper (joint work with Jasmine Silver, Rob Worden, and Sarah McLean), Mapping attitudes towards the police at micro places, has been published in the most recent issue of the Journal of Quantitative Criminology. Here is the abstract:

Objectives: We examine satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. We illustrate the utility of this approach by comparing micro- and meso-level aggregations of policing attitudes, as well as by predicting views about the police from crime data at micro places.

Methods: In each survey, respondents provided the nearest intersection to their address. Using that geocoded survey data, we use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city and compare the micro-level pattern of policing attitudes to survey data aggregated to the census tract. We also use spatial and multi-level regression models to estimate the effect of local violent crimes on attitudes towards police, controlling for other individual and neighborhood level characteristics.

Results: We demonstrate that there are no systematic biases for respondents refusing to answer the nearest intersection question. We show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime. Models predicting satisfaction with police show that local counts of violent crime are a strong predictor of attitudes towards police, even above individual level predictors of race and age.

Conclusions: Asking survey respondents to provide the nearest intersection to where they live is a simple approach to mapping attitudes towards police at micro places. This approach provides advantages beyond those of using traditional neighborhood boundaries. Specifically, it provides more precise locations police may target interventions, as well as illuminates an important predictor (i.e., nearby violent crimes) of policing attitudes.

And this was one of my favorites to make maps. We show how to take surveys and create analogs of hot spot maps of negative sentiment towards police. We do this via asking individuals to list their closest intersection (to still give some anonymity), and then create inverse distance weighted maps of negative attitudes towards police.

We also find in this work that nearby crimes are the biggest factor in predicting negative sentiment towards police. This hints that past results aggregating attitudes to neighborhoods is inappropriate, and that police reducing crime is likely to have the best margin in terms of making people more happy with the police in general.

As always, feel free to reach out for a copy of the paper if you cannot access JQC. (Or you could go a view the pre-print.)

Notes on making Leaflet maps in R

The other day I wrote a blog post for crimrxiv about posting interactive graphics on their pre-print sharing service. I figured it would be good to share my notes on making interactive maps, and to date I’ve mostly created these using the R leaflet library.

The reason I like these interactive maps is they allow you to zoom in and look at hot spots of crime. With the slippy base maps you can then see, oh OK this hot spot is by a train station, or an apartment complex, etc. It also allows you to check out specific data labels via pop-ups as I will show.

I’m using data from my paper on creating cost of crime weighted hot spots in Dallas (that will be forthcoming in Police Quarterly soonish). But I have posted a more direct set of replicating code for the blog post here.

R Code

So first for the R libraries I am using, I also change the working directory to where I have my data located on my Windows machine.

##########################################################
#This code creates a nice leaflet map of my DBSCAN areas

library(rgdal)       #read in shapefiles
library(sp)          #spatial objects
library(leaflet)     #for creating interactive maps
library(htmlwidgets) #for exporting interactive maps

#will need to change baseLoc if replicating on your machine
baseLoc <- "D:\\Dropbox\\Dropbox\\Documents\\BLOG\\leaflet_R_examples\\Analysis"
setwd(baseLoc)
##########################################################

Second, I read in my shapefiles using the rgdal library. This is important, as it includes the projection information. To plot the spatial objects on a slippy map they need to be in the Web Mercator projection (or technically no projection, just a coordinate reference system for the globe). As another trick I like with these basemaps, for the outlined area (the Dallas boundary here), it is easier to plot as a line spatial object, as opposed to an empty filled polygon. You don’t need to worry about the order of the layers as much that way.

##########################################################
#Get the boundary data and DBSCAN data
boundary <- readOGR(dsn="Dallas_MainArea_Proj.shp",layer="Dallas_MainArea_Proj")
dbscan_areas <- readOGR(dsn="db_scan.shp",layer="db_scan")

#Now convert to WGS
DalLatLon <- spTransform(boundary,CRS("+init=epsg:4326"))
DallLine <- as(DalLatLon, 'SpatialLines') #Leaflet useful for boundaries to be lines instead of areas
dbscan_LatLon <- spTransform(dbscan_areas,CRS("+init=epsg:4326") )

#Quick and Dirty plot to check projections are OK
plot(DallLine)
plot(dbscan_LatLon,add=TRUE,col='blue')
##########################################################

Next part, I have a custom function I have made to make pop-up labels for these leaflet maps. First I need to read in a table with the data info for the hot spot areas and merge that into the spatial object. Then the way my custom function works is I pass it the dataset, then I have arguments for the variables I want, and the way I want them labeled. The function does the work of making the labels bolded and putting in line breaks into the HTML. (No doubt others have created nice libraries to do HTML tables/graphs inside the pop-ups that I am unaware of.) If you check out the final print statement, it shows the HTML it built for one of the labels, <strong>ID: </strong>1<br><strong>$ (Thousands): </strong>116.9<br><strong>PAI: </strong>10.3<br><strong>Street Length (Miles): </strong>0.4

##########################################################
#Function for labels

#read in data
crime_stats <- read.csv('ClusterStats_wlen.csv', stringsAsFactors=FALSE)
dbscan_stats <- crime_stats[crime_stats$type == 'DBSCAN',]
dbscan_stats$clus_id <- as.numeric(dbscan_stats$AreaStr) #because factors=False!

#merge into the dbscan areas
dbscan_LL <- merge(dbscan_LatLon,dbscan_stats)

LabFunct <- function(data,vars,labs){
  n <- length(labs)
  add_lab <- paste0("<strong>",labs[1],"</strong>",data[,vars[1]])
  for (i in 2:n){
    add_lab <- paste0(add_lab,"<br><strong>",labs[i],"</strong>",data[,vars[i]])
  }
  return(add_lab)
}

#create labels
vs <- c('AreaStr', 'val_th', 'PAI_valth_len', 'LenMile')
#Lazy, so just going to round these values
for (v in vs[-1]){
  dbscan_LL@data[,v] <- round(dbscan_LL@data[,v],1)
}  
lb <- c('ID: ','$ (Thousands): ','PAI: ','Street Length (Miles): ')
diss_lab <- LabFunct(dbscan_LL@data, vs, lb)

print(diss_lab[1]) #showing off just one
##########################################################

Now finally onto the hotspot map. This is a bit to chew over, so I will go through bit-by-bit.

##########################################################
HotSpotMap <- leaflet() %>%
  addProviderTiles(providers$OpenStreetMap, group = "Open Street Map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "CartoDB Lite") %>%
  addPolylines(data=DallLine, color='black', weight=4, group="Dallas Boundary") %>%
  addPolygons(data=dbscan_LL,color = "blue", weight = 2, opacity = 1.0, 
              fillOpacity = 0.5, group="DBSCAN Areas",popup=diss_lab, 
              highlight = highlightOptions(weight = 5,bringToFront = TRUE)) %>%
  addLayersControl(baseGroups = c("Open Street Map","CartoDB Lite"),
                   overlayGroups = c("Dallas Boundary","DBSCAN Areas"),
                   options = layersControlOptions(collapsed = FALSE))  %>%
  addScaleBar(position = "bottomleft", options = scaleBarOptions(maxWidth = 100, 
              imperial = TRUE, updateWhenIdle = TRUE))
                      
HotSpotMap #this lets you view interactively

#or save to a HTML file to embed in webpage
saveWidget(HotSpotMap,"HotSpotMap.html", selfcontained = TRUE)
##########################################################

First I create the empty leaflet() object. Because I am superimposing multiple spatial layers, I don’t worry about setting the default spatial layer. Second, I add in two basemap providers, OpenStreetMap and the grey scale CartoDB positron. Positron is better IMO for visualizing global data patterns, but the open street map is better for when you zoom in and want to see exactly what is around a hot spot area. Note when adding in a layer, I give it a group name. This allows you to later toggle which provider you want via a basegroup in the layers control.

Next I add in the two spatial layers, the Dallas Boundary lines and then the hot spots. For the DBSCAN hot spots, I include a pop-up diss_lab for the dbscan hot spot layer. This allows you to click on the polygon, and you get the info I stuffed into that label vector earlier. The HTML is to make it print nicely.

Finally then I add in a layers control, so you can toggle layers on/off. Basegroups mean that only one of the options can be selected, it doesn’t make sense to have multiple basemaps selected. Overlay you can toggle on/off as needed. Here the overlay doesn’t matter much due to the nature of the map, but if you have many layers (e.g. a hot spot map and a choropleth map of demographics) being able to toggle the layers on/off helps a bit more.

Then as a final touch I add in a scale bar (that automatically updates depending on the zoom level). These aren’t my favorite with slippy maps, as I’m not even 100% sure what location the scale bar refers to offhand (the center of the map? Or literally where the scale bar is located?) But when zoomed into smaller areas like a city I guess it is not misleading.

Here is a screenshot of this created map zoomed out to the whole city using the Positron grey scale base map. So it is tough to visualize the distribution of hot spots from this. If I wanted to do that in a static map I would likely just plot the hot spot centroids, and then make the circles bigger for areas that capture more crime.

But since we can zoom in, here is another screenshot zoomed in using the OpenStreetMap basemap, and also illustrating what my pop-up labels look like.

I’m too lazy to post this exact map, but it is very similar to one I posted for my actual hot spots paper if you want to check it out directly. I host it on GitHub for free.

Here I did not show how to make a choropleth map, but Jacob Kaplan in his R book has a nice example of that. And in the future I will have to update this to show how to do the same thing in python using the Folium library. I used Folium in this blog post if you want to dig into an example though for now.

Some more examples

For some other examples of what is possible in Leaflet maps in R, here are some examples I made for my undergrad Communities and Crime class. I had students submit prediction assignments (e.g. predict the neighborhood with the most crime in Dallas, predict the street segment in Oak Cliff with the most violent crime, predict the bar with the most crimes nearby, etc.) I would then show the class the results, as well as where other students predicted. So here are some screen shots of those maps.

Choropleth

Graduated Points

Street Segment Viz