Geocoding the CMS NPI Registry (python)

So previously I wrote out creating service deserts. I have since found a nicer data source to use for this, the NPI CMS registry. This data file has over 6 million service providers across the US.

Here I will provide an example of using that data to geocode all the pharmacy’s in Texas, again using the census geocoding API and python.

Chunking up the NPI database

So first, you can again download the entire NPI database from here. So I have already downloaded and unzipped that file, which contains a CSV for the January version, named npidata_pfile_20050523-20210110.csv. So as some upfront, here are the libraries I will be using, and I also set the directory to where my data is located.

###############################
import pandas as pd
import numpy as np
import censusgeocode as cg
import time
from datetime import datetime
import os
os.chdir(r'D:\HospitalData\NPPES_Data_Dissemination_January_2021')
###############################

The file is just a bit too big for me to fit in memory on my machine. On Windows, you can use Get-Content npidata_pfile_20050523-20210110.csv | Measure-Object -Line in powershell to get the line counts, or on Unix use wc -l *.csv for example. So I know the file is not quite 6.7 million rows.

So what I do here is create a function to read in the csv file in chunks, only select the columns and rows that I want, and then return that data frame. In the end, you need to search across all of the Taxonomy codes to pull out the type of service provider you want. So for community pharmacies, the code is 3336C0003X, but it is not always in the first Taxonomy slot (some CVS’s have it in the second slot for example). You can see the big list of taxonomy codes here, so my criminology friends may say be interested in mental health or substance abuse service providers for other examples.

In addition to the taxonomy code, I always select organizations, not individuals (Entity Type = 2). And then I only select out pharmacies in Texas (although I bet you could fit all of the US pharmacies in memory pretty easily, maybe 60k in total?) Caveat emptor, I am not 100% sure how to use the deactivation codes properly in this database, as that data is always NaN for Texas pharmacies.

######################################################################
# Prepping the input data in chunks

keep_col = ['NPI','Entity Type Code','Provider Organization Name (Legal Business Name)',
            'NPI Deactivation Reason Code','NPI Deactivation Date','NPI Reactivation Date',
            'Provider First Line Business Practice Location Address',
            'Provider Business Practice Location Address City Name',
            'Provider Business Practice Location Address State Name',
            'Provider Business Practice Location Address Postal Code']
            
taxon_codes = ['Healthcare Provider Taxonomy Code_' + str(i+1) for i in range(15)]
keep_col += taxon_codes
community_pharm = '3336C0003X'
npi_csv = 'npidata_pfile_20050523-20210110.csv' #Newer files will prob change the name

# This defines the rows I want
def sub_rows(data):
    ec = data['Entity Type Code'] == "2"
    st = data['Provider Business Practice Location Address State Name'] == 'TX'
    ta = (data[taxon_codes] == community_pharm).any(axis=1)
    #ac = data['NPI Deactivation Reason Code'].isna()
    all_together = ec & st & ta #& ac
    sub = data[all_together]
    return sub

def csv_chunks(file,chunk_size,keep_cols,row_sub):
    # First lets get the header and figure out the column indices
    header_fields = list(pd.read_csv(npi_csv, nrows=1))
    header_locs = [header_fields.index(i) for i in keep_cols]
    # Now reading in a chunk of data
    skip = 1
    it_n = 0
    sub_n = 0
    ret_chunk = chunk_size
    fin_li_dat = []
    while ret_chunk == chunk_size:
        file_chunk = pd.read_csv(file, usecols=header_locs, skiprows=skip, 
                     nrows=chunk_size, names=header_fields, dtype='str')
        sub_dat = row_sub(file_chunk)
        fin_li_dat.append( sub_dat.copy() )
        skip += chunk_size
        it_n += 1
        sub_n += sub_dat.shape[0]
        print(f'Grabbed iter {it_n} total sub n so far {sub_n}')
        ret_chunk = file_chunk.shape[0]
    fin_dat = pd.concat(fin_li_dat, axis=0)
    return fin_dat


# Takes about 3 minutes
print( datetime.now() )
pharm_tx = csv_chunks(npi_csv, chunk_size=1000000, keep_cols=keep_col, row_sub=sub_rows)
print( datetime.now() )

# No deactivated codes in all of Texas
print( pharm_tx['NPI Deactivation Reason Code'].value_counts() )
######################################################################

So this ends up returning not quite 6800 pharmacies in all of Texas.

Geocoding using the census API

So first, the address data is pretty well formatted. But for those new to geocoding, if you have end parts of address strings like Apt 21 or Suite C, those endings will typically throw geocoders off the mark. So in just a few minutes, I noted the different strings that marked the parts of the addresses I should chop off, and wrote a function to clean those up. Besides that I just limit the zip code to 5 digits, as that field is a mix of 5 and 9 digit zipcodes.

######################################################################
# Now prepping the data for geocoding

ph_tx = pharm_tx.drop(columns=taxon_codes).reset_index(drop=True)

#['Provider First Line Business Practice Location Address', 'Provider Business Practice Location Address City Name', 'Provider Business Practice Location Address State Name', 'Provider Business Practice Location Address Postal Code']

# I just looked through the files and saw that after these strings are not needed
end_str = [' STE', ' SUITE', ' BLDG', ' TOWER', ', #', ' UNIT',
           ' APT', ' BUILDING',',', '#']

 
def clean_add(address):
    add_new = address.upper()
    for su in end_str:
        sf = address.find(su)
        if sf > -1:
            add_new = add_new[0:sf]
    add_new = add_new.replace('.','')
    add_new = add_new.strip()
    return add_new

# Some examples
clean_add('5700 S GESSNER DR STE G')
clean_add('10701-B WEST BELFORT SUITE 170')
clean_add('100 EAST UNIVERSITY BLVD.')
clean_add('5800 BELLAIRE BLVD BLDG 1')
clean_add('2434 N I-35 # S')

ph_tx['Zip5'] = ph_tx['Provider Business Practice Location Address Postal Code'].str[0:5]
ph_tx['Address'] = ph_tx['Provider First Line Business Practice Location Address'].apply(clean_add)
ph_tx.rename(columns={'Provider Business Practice Location Address City Name':'City',
                      'Provider Business Practice Location Address State Name':'State2'},
             inplace=True)
######################################################################

Next is my function to use the batch geocoding in the census api. Note the census api is a bit finicky – technically the census api says you can do batches of up to 5k rows, but I tend to get kicked off for higher values. So here I have a function that chunks it up into tinier batch portions and submits to the API. (A better function would cache intermediate results and wrap all that jazz in a try function.)

 ######################################################################
 #This function breaks up the input data frame into chunks
 #For the census geocoding api
 def split_geo(df, add, city, state, zipcode, chunk_size=500):
     df_new = df.copy()
     df_new.reset_index(inplace=True)
     splits = np.ceil( df.shape[0]/chunk_size)
     chunk_li = np.array_split(df_new['index'], splits)
     res_li = []
     pick_fi = []
     for i,c in enumerate(chunk_li):
         # Grab data, export to csv
         sub_data = df_new.loc[c, ['index',add,city,state,zipcode]]
         sub_data.to_csv('temp_geo.csv',header=False,index=False)
         # Geo the results and turn back into df
         print(f'Geocoding round {int(i)+1} of {int(splits)}, {datetime.now()}')
         result = cg.addressbatch('temp_geo.csv') #should try/except?
         # May want to dump the intermediate results
         #pi_str = f'pickres_{int(i)}.p'
         #pickle.dump( favorite_color, open( pi_str, "wb" ) )
         #pick_fi.append(pi_str.copy())
         names = list(result[0].keys())
         res_zl = []
         for r in result:
             res_zl.append( list(r.values()) )
         res_df = pd.DataFrame(res_zl, columns=names)
         res_li.append( res_df.copy() )
         time.sleep(10) #sleep 10 seconds to not get cutoff from request
     final_df = pd.concat(res_li)
     final_df.rename(columns={'id':'row'}, inplace=True)
     final_df.reset_index(inplace=True, drop=True)
     # Clean up csv file
     os.remove('temp_geo.csv')
     return final_df
 ######################################################################

And now we are onto the final stage, actually running the geocoding function, and piping the end results to a csv file. (Which you can see the final version here.)

######################################################################
# Geocoding the data in chunks

# Takes around 35 minutes
geo_pharm = split_geo(ph_tx, add='Address', city='City', state='State2', zipcode='Zip5', chunk_size=100)

#What is the geocoding hit rate?
print( geo_pharm['match'].value_counts() )
# Only around 65%

# Now merging back with the original data if you want
# Not quite sorted how I need them
geo_pharm['rowN'] = geo_pharm['row'].astype(int)
gp2 = geo_pharm.sort_values(by='rowN').reset_index(drop=True)

# Fields I want
kg = ['address','match','lat','lon']
kd = ['NPI',
      'Provider Organization Name (Legal Business Name)',
      'Provider First Line Business Practice Location Address',
      'Address','City','State2','Zip5']

final_pharm = pd.concat( [ph_tx[kd], gp2[kg]], axis=1 )

final_pharm.to_csv('Pharmacies_Texas.csv',index=False)
######################################################################

Unfortunately the geocoding hit rate is pretty disappointing, only around 65% in this sample. So if I were using this for a project, I would likely do a round of geocoding using the Google API (which is a bit more unforgiving for varied addresses), or perhaps build my own openstreet map geocoder for the US. (Or in general if you don’t have too many to review, doing it interactively in ArcGIS is very nice as well if you have access to Arc.)

Buffers and hospital deserts with geopandas

Just a quick blog post today. As a bit of a side project at work I have been looking into medical service provider deserts. Most people simply use a geographic cutoff of say 1 mile (see Wissah et al., 2020 for example for Pharmacy deserts). Also for CJ folks, John Hipp has done some related work for parolees being nearby service providers (Hipp et al., 2009; 2011), measuring nearby as 2 miles.

So I wrote some code to calculate nice sequential buffer areas and dissolve them in geopandas. Files and code to showcase are here on GitHub. First, as an example dataset, I geocode (using the census geocoding API) CMS certified Home Healthcare facilities, so these are hospice facilities. To see a map of those facilities across the US, and you can click on the button to get info, go to here, CMS HOME FACILITY MAP. Below is a screenshot:

Next I then generate sequential buffers in kilometers of 2, 4, 8, 16, and then the leftover (just for Texas). So you can then zoom in and darker areas are at a higher risk of not having a hospice facility nearby. HOSPICE DESERT MAP

Plotting some of these in Folium were giving me fits, so I will need to familiarize myself with that more in the future. The buffers for the full US as well were giving me trouble (these just for Texas result in fairly large files, surprised Github doesn’t yell at me for them being too big).

Going forward, what I want to do is instead of relying on a fixed function of distance, is to fit a model to identify individuals probability of going to the doctor based on distance. So instead of just saying 1+ mile and you are at high risk, fit a function that defines that distance based on behavioral data (maybe using insurance claims). Also I think the distances matter quite a bit for urban/rural and car/no-car. So rural folks traveling a mile is not a big deal, since you need a car to really do anything in rural areas. But for folks in the city relying on public transportation going a mile or two is a bigger deal.

The model then would be similar to the work I did with Gio on gunshot death risk (Circo & Wheeler, 2020), although I imagine the model would spatially vary (so maybe geographically weighted regression may work out well).

References