All posts tagged scraping

I scraped the Crime solutions site

Before I get to the main gist, I am going to talk about another site. The National Institute of Justice (NIJ) paid RTI over $10 million dollars to develop a forensic technology center of excellence over the past 5 years. While this effort involved more than just a website, the only thing that lives in perpetuity for others to learn from the center of excellence are the resources they provide on the website.

Once funding was pulled, this is what RTI did with those resources:

The website is not even up anymore (it is probably a good domain to snatch up if no one owns it anymore), but you can see what it looked like on the internet archive. It likely had over 1000+ videos and pages of material.

I have many friends at RTI. It is hard for me to articulate how distasteful I find this. I understand RTI is upset with the federal government cuts, but to just simply leave the website up is a minimal cost (and likely worth it to RTI just for the SEO links to other RTI work).

Imagine you paid someone $1 million dollars for something. They build it, and then later say “for $1 million more, I can do more”. You say ok, then after you have dispersed $500,000 you say “I am not going to spend more”. In response, the creator destroys all the material. This is what RTI did, except it was they had been paid $11 million and they were still to be paid another $1 million. Going forward, if anyone from NIJ is listening, government contracts to build external resources should be licensed in a way that prevents that from happening.

And this brings me to the current topic, CrimeSolutions.gov. It is a bit of a different scenario, as NIJ controls this website. But recently they cut funding to the program, which was administered by DSG.

Crime Solutions is a website where they have collected independent ratings of research on criminal justice topics. To date they have something like 800 ratings on the website. I have participated in quite a few, and I think these are high quality.

To prevent someone (for whatever reason) simply turning off the lights, I scraped the site and posted the results to github. It is a PHP site under the hood, but changing everything to run as a static HTML site did not work out too badly.

So for now, you can view the material at the original website. But if that goes down, you have a close to same functional site mirrored at https://apwheele.github.io/crime-solutions/index.html. So at least those 800 some reviews will not be lost.

What is the long term solution? I could be a butthead and tomorrow take down my github page (so clone it locally), so me scraping the site is not really a solution as much as a stopgap.

Ultimately we want a long term, public, storage solution that is not controlled by a single actor. The best solution we have now is ArDrive via the folks from Arweave. For a one time upfront purchase, Arweave guarantees the data will last a minimum of 200 years (they fund an endowment to continually pay for upkeep and storage costs). If you want to learn more, stay tuned, as me and Scott Jacques are working on migrating much of the CrimRXiv and CrimConsortium work to this more permanent solution.

Downloading Police Employment Trends from the FBI Data Explorer

The other day on the IACA forums, an analyst asked about comparing her agencies per-capita rate for sworn/non-sworn compared to other agencies. This is data available via the FBI’s Crime Data Explorer. Specifically they have released a dataset of employment rates, broken down by various agencies, over time.

The Crime Data Explorer to me is a bit difficult to navigate, so this post is going to show using the API to query the data in python (maybe it is easier to get via direct downloads, I am not sure). So first, go to that link above and sign up for a free API key.

Now, in python, first the API works via asking for a specific agencies ORI, as well as date ranges. (You can do a query for national and overall state as well, but I would rarely want those levels of aggregation.) So first we are just going to grab all of the agencies across 50 states. This runs fairly fast, only takes a few minutes:

import pandas as pd
import requests

key = 'Insert your key here'

state_list = ("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
              "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
              "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
              "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI",
              "SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY","DC")

# Looping over states, getting all of the ORIs
fin_data = []
for s in states:
    url = f'https://api.usa.gov/crime/fbi/cde/agency/byStateAbbr/{s}?API_KEY={key}'
    data = requests.get(url)
    fin_data.append(pd.DataFrame(data.json()))

agency = pd.concat(fin_data,axis=0).reset_index(drop=True)

And the agency dataframe has just a few shy of 19k ORI’s listed. Unfortunately this does not have much else associated with the agencies (such as the most recent population). It would be nice if this list had population counts (so if you just wanted to compare yourself to other similar size agencies), but alas it does not. So the second part here – scraping all 18,000+ agencies, takes a bit (let it run overnight).

# Now grabbing the full employment data
ystart = 1960   # some have data going back to 1960
yend = 2022
emp_data = []

# try/catch, as some of these can fail
for i,o in enumerate(agency['ori']):
    print(f'Getting agency {i+1} out of {agency.shape}')
    url = ('https://api.usa.gov/crime/fbi/cde/pe/agency/'
          f'{o}/byYearRange?from={ystart}&to={yend}&API_KEY={key}')
    try:
        data = requests.get(url)
        emp_data.append(pd.DataFrame(data.json()))
    except:
        print(f'Failed to query {o}')

emp_pd = pd.concat(emp_data).reset_index(drop=True)
emp_pd.to_csv('EmployeePoliceData.csv',index=False)

And that will get you 100% of the employee data on the FBI data explorer, including data for 2022.

To plug my consulting firm here, this is something that takes a bit of work. If you have longer running scraping jobs, I paired this code example down to be quite minimial, but you want to periodically save results and have the code be able to run from the last save point. So if you scrape 1000 agencies, your internet goes out, you don’t want to have to start from 0, you want to start from the last point you left off.

If interested in other tutorials like this, I suggest you check out two of my books:

Each can be purchase in either paperback for epub versions worldwide from my Crime De-Coder store.

If that is something you need, it makes sense to send me an email to see if I can help. For that and more, check out my website, crimede-coder.com:

4 Comments

by Andy Wheeler on July 29, 2023 • Permalink

Posted in Crime Analysis, Criminal Justice, Python

Tagged scraping

Posted by Andy Wheeler on July 29, 2023

https://andrewpwheeler.com/2023/07/29/downloading-police-employment-trends-from-the-fbi-data-explorer/

Web scraping police data using selenium and python

So I have a few posts in the past on scraping data. One shows downloading and parsing structured PDFs, almost all of the rest though use either JSON API backends, or just grab the HTML data directly. These are fairly straightforward to deal with in python. You generate the url directly, use requests, and then just parse the returned HTML however you want.

Came across a situation recently though where I needed to interact with the webpage. I figured a blog post to illustrate the process would be good. (For both myself and others!) So here I will illustrate entering data into San Antonio’s historical calls for service asp application (which I have seen several PDs use in the past).

It is tough for me to give general advice about scraping, it involves digging into the source code for a website. Here if you click on the Historical Calls button, the url stays the same, but presents you with a new form page to insert your search parameters:

This is a bit of a red-herring though, it ends up being the entire page is embedded in what is called an i-frame, so the host URL stays the same, but the window inside the webpage changes. On the prior opening page, if you hover over the link for Historical Calls you can see it points to https://webapp3.sanantonio.gov/policecalls/Reports.aspx, so that is page we really need to pay attention to.

So for general advice, using Chrome to view a web-pages source html, you can right-click and select view-source:

And you can also go into the Developer tools to check out all the items in a page as well.

Typically before worrying about selenium, I study the network tab in here. You want to pay attention to the items that take the longest/have the most data. Typically I am looking for JSON or text files here if I can’t scrape the data directly from the HTML. (Example blog posts grabbing an entire dump of data here, and another finding a hidden/undocumented JSON api using this approach.) Here is an example network call when inputting the search into the San Antonio web-app.

The data is all being transmitted inside of aspx application, not via JSON or other plain text files (don’t take my terminology here as authoritative, I really know near 0% about servers). So we will need to use selenium here. Using python you can install the selenium library, but you also need to download a driver (here I use chrome), and then wherever you save that exe file, add that location to your PATH environment variable.

Now you are ready for the python part.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
import pandas as pd

# Setting Chrome Options
chrome_options = Options()
#chrome_options.add_argument("-- headless")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("log-level=3")

# Getting the base page
driver = webdriver.Chrome(options=chrome_options)
base_url = "https://webapp3.sanantonio.gov/policecalls/Reports.aspx"
driver = webdriver.Chrome(options=chrome_options)
driver.get(base_url)

Once you run this code, you will see a new browser pop-up. This is great for debugging, but once you get your script finalized, you can see I commented out a line to run in headerless (so it doesn’t bug you by flashing up the browser on your screen).

Now typically what I do is look at the HTML source (like I showed earlier), and then search for the input buttons in HTML. We are trying to figure out the elements we need to insert the data for us to submit a search. Here is the first input for an item we care about, the begin date of the search.

Now we can insert our own date by grabbing the element from the web-page. I grab it here by the “id” attribute in the HTML (many tutorials use xpath, which I am not as familiar with, but at least for these aspx apps what I show works fine). For dates that have a validation stage, you need to not only .send_keys, but to also submit to get past the date validation.

# Inserting date field for begin date
from_date = driver.find_element("id", "txtStart")
from_date.send_keys("10/01/2022")
from_date.submit()

Once you run that code you can actually view the web-page, and see that your date is entered! Now we need to do the same thing for the end date. Then we can put in a plain text zipcode. Since this does not have validation, we do not need to submit it.

# Now for end date
end_date = driver.find_element("id", "txtEndDate")
end_date.send_keys("10/02/2022")
end_date.submit()

# Now inserting text for zipcode
zip = driver.find_element("id", "txtZipcode")
zip.send_keys("78207")
# Sometimes need to clear, zip.clear()

I have a note there on clearing a text box. Sometimes websites have pre-filled options. Sometimes web-sites also do not like .clear(), and you can simulate backspace keystrokes directly. This website does not like it if you clear a date-field for example.

Now the last part, I am going to select a drop-down. If you go into the HTML source again, you can see the list of options.

And now we can use the Select function I imported at the beginning to select a particular element of that drop-down. Here I select the crimes against persons.

# Now selecting dropdown
crime_cat = driver.find_element("id", "ddlCategory")
crime_sel = Select(crime_cat)
crime_sel.select_by_visible_text("Crimes Against Person Calls")

Many of these applications have rate limits, so you need to limit the search to tiny windows and subsets, and then loop over the different sets you want to grab all of the data. (Being nice and using time.sleep() between calls to get the results.

Now we are ready to submit the query. The same way you can enter in text into input forms, buttons you can click are also labeled as inputs in the HTML. Here I find the submit button, and then .click() that button. (If there is a direct button to download CSV or some other format, it may make sense to click that button.)

# Now can find the View Data button and submit
view_data = driver.find_element("id", "btnSearch")
view_data.click()

Now that we have our web-page, we can get the HTML source directly and then parse that. Pandas has a nice method to grab tables, and this application is actually very nicely formatted. (I tend to not use this, as many webpages have some very bespoke tables that are hard to grab directly like this). This method grabs all the tables in the web-page by default, here I just want the calls for service table, which has an id of "gvCFS", which I can pass into the pandas .read_html function.

# Pandas has a nice option to read tables directly
html = driver.page_source
cfs = pd.read_html(html, attrs={"id":"gvCFS"})[0]

And that shows grabbing a single result. Of course to scrape, you will need to loop over many days (and here different search selections), depending on what data you want to grab. Most of these applications have search limits, so if you do too large a search, will only return the first say 500 results. And San Antonio’s is nice because it returns as a single table in the web-page, most you need to page the results though as well. Which takes further scraping the data and interacting with the page. So it is more painful whenever you need to resort to selenium.

Sometimes pages will point to PDF files, and you can set Chrome’s options to download to a particular location in that scenario (and then use os.rename to name the PDF whatever you want after it is downloaded). You can basically do anything in selenium you can manually, it is often just a tricky set of steps to replicate in code.

Downloading geo files from Census FTP using python

I was working with some health data that only has MSA identifiers the other day. Not many people seem to know about the US Census’s FTP data site. Over the years they have had various terrible GUI’s to download data, but I almost always just go to the FTP site directly.

For geo data, check out https://www2.census.gov/geo/tiger/TIGER2019/ for example. Python for pandas/geopandas also has the nicety that you can point to a url (even a url of a zip file), and load in the data in memory. So to get the MSA areas was very simple:

# Example download MSA
import geopandas as gpd
from matplotlib import pyplot as plt

url_msa = r'https://www2.census.gov/geo/tiger/TIGER2019/CBSA/tl_2019_us_cbsa.zip'
msa = gpd.read_file(url_msa)
msa.plot()
plt.show()

Sometimes the census has files spread across multiple states. So here is an example of doing some simple scraping to get all of the census tracts in the US. You can combine the geopandas dataframes the same as pandas dataframes using pd.concat:

# Example scraping all of the zip urls on a page
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests

def get_zip(url):
    front_page = requests.get(url,verify=False)
    soup = BeautifulSoup(front_page.content,'html.parser')
    zf = soup.find_all("a",href=re.compile(r"zip"))
    # Maybe should use href 
    zl = [os.path.join(url,i['href']) for i in zf]
    return zl

base_url = r'https://www2.census.gov/geo/tiger/TIGER2019/TRACT/'
res = get_zip(base_url)

geo_tract = []
for surl in res:
    geo_tract.append(gpd.read_file(surl))

geo_full = pd.concat(geo_tract)

# See State FIPS codes
# https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696

geo_full[geo_full['STATEFP'] == '01'].plot()
plt.show()

Unfortunately for the census data tables, such as https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/5_year_seq_by_state/Alabama/Tracts_Block_Groups_Only/, those zip files contain two files (an estimate file and a margin of error file), so you cannot just do pd.read_csv(url) for those tables. But for the shapefile zip files this appears to work just fine and dandy.

I am currently working on a project at work (but Gainwell has given me the thumbs up to open source it) to build tables to create the CDC’s Social Vulnerability Index, which I can build for multiple geographies in combo with the census data. So hopefully in the next few weeks will be able to share that work.

1 Comment

by Andy Wheeler on February 28, 2022 • Permalink

Posted in Mapping, Python

Tagged census, geopandas, scraping

Posted by Andy Wheeler on February 28, 2022

https://andrewpwheeler.com/2022/02/28/downloading-geo-files-from-census-ftp-using-python/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 388 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

Andrew Wheeler

All posts tagged scraping

I scraped the Crime solutions site

Downloading Police Employment Trends from the FBI Data Explorer

Web scraping police data using selenium and python

Downloading geo files from Census FTP using python

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange