I’ve written prior blog posts about downloading Five Year American Community Survey data estimates (ACS for short) for small area geographies, but one of the main hiccups is figuring out what variables you want to use. The census has so many variables that are just small iterations of one another (e.g. Males under 5, males 5 to 9, males 10 to 14, etc.) that it is quite a chore to specify the ones you want. Often you want combinations of variables or to calculate percentages as well, so you need to take two or more variables and turn them into your constructed variable.
I have posted some notes on the variables I have used for past projects in an excel spreadsheet. This includes the original variables, as well as some notes for creating percentage variables. Some are tricky — such as figuring out the proportion of black residents for block groups you need to add non-Hispanic black and Hispanic black estimates (and then divide by the total population). For spatially oriented criminologists these are basically indicators commonly used for social disorganization. It also includes notes on what is available at the smaller block group level, as not all of the variables are. So you are more limited in your choices if you want that small of area.
Let me know if you have been using other variables for your work. I’m not an expert on these variables by any stretch, so don’t take my list as authoritative in any way. For example I have no idea whether it is valid to use the imputed data for moving in the prior year at the block group level. (In general I have not incorporated the estimates of uncertainty for any of the variables into my analyses, not sure of the additional implications for the imputed data tables.) Also I have not incorporated variables that could be used for income-inequality or for ethnic heterogeneity (besides using white/black/Hispanic to calculate the index). I’m sure there are other social disorganization relevant variables at the block group level folks may be interested in as well. So let me know in the comments or shoot me an email if you have suggestions to update my list.
I would prefer if as a field we could create a set of standardized indices so we are not all using different variables (see for example this Jeremy Miles paper). It is a bit hodge-podge though what variables folks use from study-to-study, and most folks don’t report the original variables so it is hard to replicate their work exactly. British folks have their index of deprivation, and it would be nice to have a similarly standardized measure to use in social science research for the states.
The ACS data has consistent variable names over the years, such as B03001_001
is the total population, B03002_003
is the Non-Hispanic white population, etc. Unfortunately those variables are not necessarily in the same tables from year to year, so concatenating ACS results over multiple years is a bit of a pain. Below I post a python script that given a directory of the excel template files will produce a nice set of dictionaries to help find what table particular variables are in.
#This python code grabs ACS meta-data templates
#To easier search for tables that have particular variables
import xlrd, os
mydir = r'!!!Insert your path to the excel files here!!!!!'
def acs_vars(directory):
#get the excel files in the directory
excel_files = []
for file in os.listdir(directory):
if file.endswith(".xls"):
excel_files.append( os.path.join(directory, file) )
#getting the variables in a nice dictionaries
lab_dict = {}
loc_dict = {}
for file in excel_files:
book = xlrd.open_workbook(file) #first open the xls workbook
sh = book.sheet_by_index(0)
vars = [i.value for i in sh.row(0)] #names on the first row
labs = [i.value for i in sh.row(1)] #labels on the second
#now add to the overall dictionary
for v,l in zip(vars,labs):
lab_dict[v] = l
loc_dict[v] = file
#returning the two dictionaries
return lab_dict,loc_dict
labels,tables = acs_vars(mydir)
#now if you have a list of variables you want, you can figure out the table
interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
'B11003_016','B14006_002','B01001_003','B23025_005','B22010_002',
'B16002_004']
for i in interest:
head, tail = os.path.split(tables[i])
print (i,labels[i],tail)
YD
/ August 24, 2018I am always following your post even though I don’t always understand or connect myself to all of them. Sometimes I think your posting is a detailed advice on the questions that I asked you in person. I appreciate that no matter if you meant it or not.
apwheele
/ August 24, 2018Yes you were the most recent student to ask, but I have given the same advice to others. It is helpful for myself even to make a post (instead of mucking through various files to figure out the variables I have used in the past).
Basically it is easier to search my blog than all of my project files. I see I need to add one more in my list, I used % renters for a recent paper at the block group level instead of % moved in the prior year.