I had a prior blog post on working with American Community Survey data in SPSS. The meta-data format has changed from that example though, and the Census gives out comma separated files and xls Templates now. So this will be an update, and I have good stuff for those working strictly in python, as well as those wanting to load the data is SPSS.
So first, when downloading the small geographies from the Census’s FTP site, they have a ton of files. See this page, which contains the 5 year estimates for 2014 for New York block groups and tracts. Now instead of downloading each zip file one by one, we can write a quick python script to download all the files.
import urllib, os
downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
base = r'http://www2.census.gov/programs-surveys/acs/summary_file/2014/data/5_year_seq_by_state/NewYork/Tracts_Block_Groups_Only/'
for i in range(1,5): #change range(1,5) to range(1,122) to download all zip files
file = "20145ny0" + str(i).zfill(3) + "000.zip"
urllib.urlretrieve(base + file, os.path.join(downFold,file))
#also download the geography file
urllib.urlretrieve(base + "g20145ny.csv", os.path.join(downFold,"g20145ny.csv"))
The downFold
string is where the files will be downloaded to (so change that to a place on your local machine), and the base
string ends up being the base URL for that particular set of files. The files go from 1 to 121 in that example, but just to keep the time down I only download tables one through four. The second urlib.urlretrieve
line downloads the geography csv file (we won’t be using the other geography file, which is the same data but in tab delimited format).
Now we can go and download the meta data excel file shells. For this dataset they are located here. Here we want the 5 year templates. Once that data is downloaded, then unzip all of the files. You could technically do this in python as well, but I just use 7zip, as that has a handy dialog to unzip multiple files to the same place.
So the way the files work, there are a set of estimate and margin of error text files that are comma delimited that have the demographic characteristics. (Here for all block groups and census tracts in New York.) The xls excel files contain the variable names, along with a brief description for the variables.
If you are a hipster and only do data analysis in python, here is a function that takes the location to a xls template file and the corresponding data file and reads it into a pandas data frame.
#this reads in american community survey data
import xlrd
import pandas as pd
def readACS(Template,Data):
book = xlrd.open_workbook(Template) #first open the xls workbook
sh = book.sheet_by_index(0)
vars = [i.value for i in sh.row(0)] #names on the first row
labs = [i.value for i in sh.row(1)] #labels on the second
#this rewrites duplicate 'BLANK' names, mangle dups not working for me
n = 0
vars2 = []
for i in range(len(vars)):
if vars[i] == 'BLANK':
n += 1
vars2.append('BLANK.' + str(n))
else:
vars2.append(vars[i])
#check for if geo file or data file
if vars2[1] == 'FILETYPE':
df = pd.read_csv(Data,names=vars2,dtype={'FILETYPE':'object'})
else:
df = pd.read_csv(Data,names=vars2)
return df,zip(vars2,labs)
In a nutshell, it reads the metadata column names and labels from the excel spreadsheet, then reads in the csv file with the data. It returns two objects, the one on the left is a pandas dataframe, and the one on the right is a zipped up list of the variable names and the variable labels. This would be a bit simpler, except that the format for the geo dataset is a little different than all the data files and contains multiple “BLANK” fields (the mangle_dupe_cols
option in read_csv
is not behaving like I expect it to). For the non-geographic file, I need to tell python the filetype column is a string, else it interprets the “e” in the estimate files as a scientific number (e.g. 1e5 = 100,000).
So here is an example of using this function to grab the second table. When I unzipped the excel templates, it nested the data templates in another subfolder, hence the TemplateFold
string.
TemplateFold = downFold + r'\seq'
Tab002,Meta002 = readACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')
If you want to check out all the variable labels, you can then do:
for i in Meta002:
print i
Or if you want to turn that into a dictionary you can simply do dict(Meta002)
. If you wanted to import all 121 tables and merge them you should be able to figure that out in a simple loop from here (note the “x.zfill(n)” function to pad the integers with leading zeroes). But I typically only work with a select few tables and variables at a time, so I won’t worry about that here.
The function works the same with the geographic data and its template. (Which that metadata template is not nested in the further down seq
folder.)
GeoDat,MetaGeo = readACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')
Note if you are working with both the estimates and the margin of error files, you may want to put somewhere in the code to change the variable names to signify that, such as by putting a suffix of “e” or “m”. If you just work with the estimates though you don’t need to worry about that.
Reading ACS data into SPSS
For those working in SPSS, I’ve shown previously how to turn python data into SPSS data. I’ve started working on a function to make this simpler with pandas dataframes, but I will hold off on that for now (need to figure out datetimes and NaN’s). So what I did here was grab the meta-data from the template xls file (same as before), but from that build the necessary DATA LIST
command in SPSS, and then just submit the command. SPSS has the added benefit of having native meta-data fields, so I can also add in the variable labels. Also, this only uses the xlrd library, in case you do not have access to pandas. (I point SPSS to Anaconda, instead of worrying about using pip with the native SPSS python install.)
So in SPSS, you would first define this function
*This just builds the necessary SPSS program to read in the american community survey data.
BEGIN PROGRAM Python.
#creating my own function to read in data
import xlrd, spss
def OpenACS(Template,Data):
book = xlrd.open_workbook(Template)
sh = book.sheet_by_index(0)
vars = [i.value for i in sh.row(0)]
labs = [i.value for i in sh.row(1)]
#this rewrites duplicate 'BLANK' names, mangle dups not working for me
n = 0
vars2 = []
for i in range(len(vars)):
if vars[i] == 'BLANK':
n += 1
vars2.append('BLANK.' + str(n))
else:
vars2.append(vars[i])
#check for if geo file or data file
if vars2[1] == 'FILETYPE': #regular data
ncols = sh.ncols - 6 #need the end of the number of variables
ext = ' (' + str(ncols) + 'F7.0)'
v1 = ' /FILEID FILETYPE (2A6) STUSAB (A2) CHARITER (A3) SEQUENCE (A4) LOGRECNO (A7) '
v2 = '\n '.join(vars2[6:])
Tab = Data[-10:-7] #Names the dataset based on the table number
else: #geo data file
ncols = sh.ncols
ext = ' (' + str(ncols) + 'A255)' #255 should be big enough to fit whatever str
v1 = " / "
v2 = '\n '.join(vars2)
Tab = "Geo"
#this sets errors off, implicit missing data for block groups
spss.Submit("PRESERVE.")
spss.Submit("SET RESULTS OFF ERRORS OFF.")
#now creating the import program to read in the data
begin = "DATA LIST LIST(',') FILE = '%s'" % (Data)
full_str = begin + v1 + v2 + ext + "\n."
#now reading in the dataset
spss.Submit(full_str)
#assigning a dataset name
datName = "DATASET NAME Table" + Tab + "."
spss.Submit(datName)
#now adding in the variable labels
for i,j in zip(vars2,labs):
#replaces double quotes with single quotes in label
strVal = """VARIABLE LABELS %s "%s".""" % (i,j.replace('"',"'"))
spss.Submit(strVal)
if Tab == "Geo":
spss.Submit("ALTER TYPE ALL (A = AMIN).")
spss.Submit("RESTORE.")
END PROGRAM.
Again this is much shorter if I only needed to worry about the data files and not the geo file, but that slight formatting difference is a bit of a pain. Here I use the errors off trick to suppress the data list errors for missing data (which is intended, as not all of the data is available at the block group level). But you will still get error messages in the SPSS syntax bottom section. They can be ignored if it is the “insufficient data” warning.
Here is an example of using this python function now to read the data into SPSS. This automatically assigns a dataset name, either based on the Table number, or “Geo” for the geographic data.
*Now reading in the data files I want.
BEGIN PROGRAM Python.
downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
TemplateFold = downFold + r'\seq'
#reading in Data file, table number 2
OpenACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')
#reading in Geo file
OpenACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')
END PROGRAM.
EXECUTE.
And Voila, there is your small area American Community Survey data in SPSS. This will produce two different datasets, “Table002” and “TableGeo” that can be merged together via MATCH FILES
.
Let me know in the comments if you have already worked out a function to turn pandas dataframes into SPSS datasets.
2 Comments