So previously I wrote out creating service deserts. I have since found a nicer data source to use for this, the NPI CMS registry. This data file has over 6 million service providers across the US.
Here I will provide an example of using that data to geocode all the pharmacy’s in Texas, again using the census geocoding API and python.
Chunking up the NPI database
So first, you can again download the entire NPI database from here. So I have already downloaded and unzipped that file, which contains a CSV for the January version, named npidata_pfile_20050523-20210110.csv
. So as some upfront, here are the libraries I will be using, and I also set the directory to where my data is located.
###############################
import pandas as pd
import numpy as np
import censusgeocode as cg
import time
from datetime import datetime
import os
os.chdir(r'D:\HospitalData\NPPES_Data_Dissemination_January_2021')
###############################
The file is just a bit too big for me to fit in memory on my machine. On Windows, you can use Get-Content npidata_pfile_20050523-20210110.csv | Measure-Object -Line
in powershell to get the line counts, or on Unix use wc -l *.csv
for example. So I know the file is not quite 6.7 million rows.
So what I do here is create a function to read in the csv file in chunks, only select the columns and rows that I want, and then return that data frame. In the end, you need to search across all of the Taxonomy codes to pull out the type of service provider you want. So for community pharmacies, the code is 3336C0003X
, but it is not always in the first Taxonomy slot (some CVS’s have it in the second slot for example). You can see the big list of taxonomy codes here, so my criminology friends may say be interested in mental health or substance abuse service providers for other examples.
In addition to the taxonomy code, I always select organizations, not individuals (Entity Type = 2
). And then I only select out pharmacies in Texas (although I bet you could fit all of the US pharmacies in memory pretty easily, maybe 60k in total?) Caveat emptor, I am not 100% sure how to use the deactivation codes properly in this database, as that data is always NaN for Texas pharmacies.
######################################################################
# Prepping the input data in chunks
keep_col = ['NPI','Entity Type Code','Provider Organization Name (Legal Business Name)',
'NPI Deactivation Reason Code','NPI Deactivation Date','NPI Reactivation Date',
'Provider First Line Business Practice Location Address',
'Provider Business Practice Location Address City Name',
'Provider Business Practice Location Address State Name',
'Provider Business Practice Location Address Postal Code']
taxon_codes = ['Healthcare Provider Taxonomy Code_' + str(i+1) for i in range(15)]
keep_col += taxon_codes
community_pharm = '3336C0003X'
npi_csv = 'npidata_pfile_20050523-20210110.csv' #Newer files will prob change the name
# This defines the rows I want
def sub_rows(data):
ec = data['Entity Type Code'] == "2"
st = data['Provider Business Practice Location Address State Name'] == 'TX'
ta = (data[taxon_codes] == community_pharm).any(axis=1)
#ac = data['NPI Deactivation Reason Code'].isna()
all_together = ec & st & ta #& ac
sub = data[all_together]
return sub
def csv_chunks(file,chunk_size,keep_cols,row_sub):
# First lets get the header and figure out the column indices
header_fields = list(pd.read_csv(npi_csv, nrows=1))
header_locs = [header_fields.index(i) for i in keep_cols]
# Now reading in a chunk of data
skip = 1
it_n = 0
sub_n = 0
ret_chunk = chunk_size
fin_li_dat = []
while ret_chunk == chunk_size:
file_chunk = pd.read_csv(file, usecols=header_locs, skiprows=skip,
nrows=chunk_size, names=header_fields, dtype='str')
sub_dat = row_sub(file_chunk)
fin_li_dat.append( sub_dat.copy() )
skip += chunk_size
it_n += 1
sub_n += sub_dat.shape[0]
print(f'Grabbed iter {it_n} total sub n so far {sub_n}')
ret_chunk = file_chunk.shape[0]
fin_dat = pd.concat(fin_li_dat, axis=0)
return fin_dat
# Takes about 3 minutes
print( datetime.now() )
pharm_tx = csv_chunks(npi_csv, chunk_size=1000000, keep_cols=keep_col, row_sub=sub_rows)
print( datetime.now() )
# No deactivated codes in all of Texas
print( pharm_tx['NPI Deactivation Reason Code'].value_counts() )
######################################################################
So this ends up returning not quite 6800 pharmacies in all of Texas.
Geocoding using the census API
So first, the address data is pretty well formatted. But for those new to geocoding, if you have end parts of address strings like Apt 21
or Suite C
, those endings will typically throw geocoders off the mark. So in just a few minutes, I noted the different strings that marked the parts of the addresses I should chop off, and wrote a function to clean those up. Besides that I just limit the zip code to 5 digits, as that field is a mix of 5 and 9 digit zipcodes.
######################################################################
# Now prepping the data for geocoding
ph_tx = pharm_tx.drop(columns=taxon_codes).reset_index(drop=True)
#['Provider First Line Business Practice Location Address', 'Provider Business Practice Location Address City Name', 'Provider Business Practice Location Address State Name', 'Provider Business Practice Location Address Postal Code']
# I just looked through the files and saw that after these strings are not needed
end_str = [' STE', ' SUITE', ' BLDG', ' TOWER', ', #', ' UNIT',
' APT', ' BUILDING',',', '#']
def clean_add(address):
add_new = address.upper()
for su in end_str:
sf = address.find(su)
if sf > -1:
add_new = add_new[0:sf]
add_new = add_new.replace('.','')
add_new = add_new.strip()
return add_new
# Some examples
clean_add('5700 S GESSNER DR STE G')
clean_add('10701-B WEST BELFORT SUITE 170')
clean_add('100 EAST UNIVERSITY BLVD.')
clean_add('5800 BELLAIRE BLVD BLDG 1')
clean_add('2434 N I-35 # S')
ph_tx['Zip5'] = ph_tx['Provider Business Practice Location Address Postal Code'].str[0:5]
ph_tx['Address'] = ph_tx['Provider First Line Business Practice Location Address'].apply(clean_add)
ph_tx.rename(columns={'Provider Business Practice Location Address City Name':'City',
'Provider Business Practice Location Address State Name':'State2'},
inplace=True)
######################################################################
Next is my function to use the batch geocoding in the census api. Note the census api is a bit finicky – technically the census api says you can do batches of up to 5k rows, but I tend to get kicked off for higher values. So here I have a function that chunks it up into tinier batch portions and submits to the API. (A better function would cache intermediate results and wrap all that jazz in a try
function.)
######################################################################
#This function breaks up the input data frame into chunks
#For the census geocoding api
def split_geo(df, add, city, state, zipcode, chunk_size=500):
df_new = df.copy()
df_new.reset_index(inplace=True)
splits = np.ceil( df.shape[0]/chunk_size)
chunk_li = np.array_split(df_new['index'], splits)
res_li = []
pick_fi = []
for i,c in enumerate(chunk_li):
# Grab data, export to csv
sub_data = df_new.loc[c, ['index',add,city,state,zipcode]]
sub_data.to_csv('temp_geo.csv',header=False,index=False)
# Geo the results and turn back into df
print(f'Geocoding round {int(i)+1} of {int(splits)}, {datetime.now()}')
result = cg.addressbatch('temp_geo.csv') #should try/except?
# May want to dump the intermediate results
#pi_str = f'pickres_{int(i)}.p'
#pickle.dump( favorite_color, open( pi_str, "wb" ) )
#pick_fi.append(pi_str.copy())
names = list(result[0].keys())
res_zl = []
for r in result:
res_zl.append( list(r.values()) )
res_df = pd.DataFrame(res_zl, columns=names)
res_li.append( res_df.copy() )
time.sleep(10) #sleep 10 seconds to not get cutoff from request
final_df = pd.concat(res_li)
final_df.rename(columns={'id':'row'}, inplace=True)
final_df.reset_index(inplace=True, drop=True)
# Clean up csv file
os.remove('temp_geo.csv')
return final_df
######################################################################
And now we are onto the final stage, actually running the geocoding function, and piping the end results to a csv file. (Which you can see the final version here.)
######################################################################
# Geocoding the data in chunks
# Takes around 35 minutes
geo_pharm = split_geo(ph_tx, add='Address', city='City', state='State2', zipcode='Zip5', chunk_size=100)
#What is the geocoding hit rate?
print( geo_pharm['match'].value_counts() )
# Only around 65%
# Now merging back with the original data if you want
# Not quite sorted how I need them
geo_pharm['rowN'] = geo_pharm['row'].astype(int)
gp2 = geo_pharm.sort_values(by='rowN').reset_index(drop=True)
# Fields I want
kg = ['address','match','lat','lon']
kd = ['NPI',
'Provider Organization Name (Legal Business Name)',
'Provider First Line Business Practice Location Address',
'Address','City','State2','Zip5']
final_pharm = pd.concat( [ph_tx[kd], gp2[kg]], axis=1 )
final_pharm.to_csv('Pharmacies_Texas.csv',index=False)
######################################################################
Unfortunately the geocoding hit rate is pretty disappointing, only around 65% in this sample. So if I were using this for a project, I would likely do a round of geocoding using the Google API (which is a bit more unforgiving for varied addresses), or perhaps build my own openstreet map geocoder for the US. (Or in general if you don’t have too many to review, doing it interactively in ArcGIS is very nice as well if you have access to Arc.)
2 Comments