Datascience Portfolio

I have created this page to display my open source work. Many people in data science often embellish their technical achievements in their resume – here I list projects that one can go and see the work directly.

Python

Focus on network statistics, linear programming, and machine learning

Using pytorch to build latent class mixture models
Synthetic control in python: Opioid death increases in Oregon and Washington, using custom Lasso implementation and conformal inference
Predicting algae blooms via satellite data, 2nd place in DrivenData competition ($9,000 in winnings)
Using linear programming to create optimal allocation with network spillovers

R

Focus on data visualization, spatial statistics, and causal inference

ptools package CRAN, github, short for Poisson tools
Managing R environments using conda
Age, period, cohort graphs for suicide and drug overdoses using ggplot

SQL

Focus on metrics and testing

Github

Building CICD and automation

Caching huggingface models in github actions
Dallas crime dashboard and Raleigh pothole chart automated in github actions run on a crontab schedule
Setting up pyspark to run SQL tests in github actions

Tableau

Tutorials for creating advanced metrics relevant to monitoring crime trends

Temporal analysis (Seasonal Charts, Weekly time series with error bars)
Example Crime Analysis Dashboard

Javascript and PHP

Some examples of server side (PHP) and client side (Javascript) web apps I deployed on my Crime De-Coder consulting website:

Network prioritization tool runs all client side in javascript (due to sensitive data)
Sworn dashboard built using D3.js and Supabase backend
PHP + google sheets backend for a custom survey, can use query encoding to route to custom surveys (S1, S2)

Selected Publications Demonstrating Different AI/Statistic/Machine-Learning Skills

Causal Inference/Policy Analysis

Wheeler, AP & Ratcliffe, J (2018) A simple weighted displacement difference test to evaluate place based crime interventions). Crime Science 7:11. (Replication Code)

I develop a simple test to conduct difference-in-difference analysis for count data when you only have pre/post counts based on the Poisson distribution. I show the test has correct coverage even for fairly low count data using simulations. The technique is intended to help crime analysts, and I provide an excel spreadsheet that can implement the technique.

Circo, G & Wheeler, AP (2023) Using Every Door Direct Mail Web Push Surveys and Multi-level modelling with Post Stratification to estimate Perceptions of Police at Small Geographies. (Code repo)

This is the winning solution to the NIJ Challenge on Innovations in Measuring Community Attitudes for the non-probability sampling approach. We suggest sending a QR code on a mailer via every-door-direct-mail, and then use multi-level regression with post-stratification to correct for differential response bias and small samples. This approach is very cost effective compared to in person canvassers, and is more geographically targeted than online approaches. Total winnings of $25,000.

Supervised/Unsupervised Machine Learning

Circo, G & Wheeler, AP (2021) National Institute of Justice Recidivism Forecasting Challenge Team “MCHawks” Performance Analysis. (Code Repo)

This technical report provides a description of our submission to the NIJ Recidivism Forecasting Challenge, in which our predictive solution placed on the leader board for 7 different categories and we collected just under $40,000 in prizes. We discuss our solution to meeting the racial fairness constraints, and suggest alternative metrics for future competitions that are likely to be less volatile to optimize.

Wheeler, AP & S Reuter (2021) Redrawing hot spots of crime in Dallas, Texas. Police Quarterly 24(2): 159-184. (Preprint, R replication code)

I use an unsupervised clustering technique (DBSCAN) to identify cost of responding to crime hot spots. I find compared to the current hot spot areas implemented by Dallas PD, my identified areas are much smaller, and capture cost of crime at much higher densities. One hot spot I identified has over a million dollars of crime cost per year, suggesting a hot spot policing strategy is likely to have appreciable return on investment for Dallas PD.

Wheeler, AP & W Steenbeek (2021) Mapping the risk terrain for crime using machine learning. Journal of Quantitative Criminology 37(2): 445-480. (Preprint, R replication code)

I spatially forecast long term robbery hot spots in Dallas using random forests and several other common techniques. I find random forests are much more accurate than other state of the art (e.g. RTM). I also use interpretable machine learning summaries to evaluate several different criminological theories and provide local summaries for individual hot spots.

Operations Research/Linear Programming/Network Algorithms

Wheeler, AP (2020) Allocating police resources while limiting racial inequality. Justice Quarterly 37(5): 842-868. (Preprint, Replication Code in python)

I tackle the problem of how hots spots policing exacerbates disproportionate minority contact, and construct a linear program intended to balance police targeting of hot spots, while constraining the number of minorities likely to be stopped by the police.

Wheeler, AP, SJ McLean, KJ Becker, & RE Worden (2019) Choosing representatives to deliver the message in a group violence intervention. Justice Evaluation Journal 2(2): 93-117. (Preprint, python replication code)

I create a greedy social network algorithm to identify individuals who should be targeted for a gang intervention, which the motivation is to spread the deterrence message to the remaining gang members. I use simulations to show it often finds the minimal dominant set for networks the typical size and density of gang networks.

Wheeler, AP (2019) Creating optimal patrol areas using the P-median model. Policing: An International Journal 42(3): 318-333. (Preprint, Replication code in python)

I formulate an integer linear program, with constraints on workload equality, to re-draw patrol beats for the Carrollton, TX police department. My results find my beats are likely to be over 20% more efficient in reducing drive time to calls for service compared to the current beat layout.

I have additional papers on time series analysis, spatial analysis, developing custom small sample statistics, and custom data visualizations. See my google scholar page for my papers.

Andrew Wheeler

Datascience Portfolio

Python

R

SQL

Github

Tableau

Javascript and PHP

Selected Publications Demonstrating Different AI/Statistic/Machine-Learning Skills

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange

Andrew Wheeler

Datascience Portfolio

Python

R

SQL

Github

Tableau

Javascript and PHP

Selected Publications Demonstrating Different AI/Statistic/Machine-Learning Skills

Share this:

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange