Stop Teaching R. Teach Python.

There has been a slight transition in social science teaching since I have been a student and professor over the past ~15+ years. In the aughts, it was still common to teach students in legacy, closed source statistical software (SPSS, SAS, and Stata). When I was a PhD student in criminal justice at SUNY Albany, we had a specific class to learn SPSS, although most of the rest of the quantitative courses used Stata.

The R programming language has likely usurped the use of the closed source languages in social science education after the aughts though. (I do not have hard data, but that is my impression seeing what colleagues are using and what they teach in classes.)

I am familiar with all of the major statistical programs (I have written an R package, and you can see this blog for many examples of SPSS and a few for Stata). If the goal in coursework is to teach your students skills relevant to help them get a job, academics in social science institutions should teach their students Python. The current job market for quantitative work is dominated by Python positions.

To be clear, I am not fundamentally opposed to closed source programming languages (there are scenarios where SPSS/SAS make more sense than Hadoop systems I have seen, also if you are a GIS analyst you should learn ESRI tools). This is purely just an observation given the current private sector job market – focusing primarily on Python makes the most sense for social science students.

As an experiment, I went onto LinkedIn and did a search for “data scientist”. Your results will differ (mine are tailored to the Raleigh area, and also includes more senior positions), but here is a table of the positions that came up on the first page, and a quick summary of the tech stacks they require. While this is not a systematic sample, it gives a reasonable snapshot of current expectations.

| Company             | Job Title                           | Tech Stack                                           | URL                                            |
|---------------------|-------------------------------------|------------------------------------------------------|----------------------------------------------- |
| Google              | Data Scientist (Google Voice 2)     | Python, R, SQL                                       | https://www.linkedin.com/jobs/view/4387751995/ |
| Deloitte            | AI Specialist                       | None specified                                       | https://www.linkedin.com/jobs/view/4376183670/ |
| Ascensus            | Principal Analytics                 | R, Python, SQL, GenAI/LLM                            | https://www.linkedin.com/jobs/view/4380164400/ |
| EY                  | AI Lead Engineer                    | Python, C#, R, GenAI/LLM                             | https://www.linkedin.com/jobs/view/4385954762/ |
| PwC                 | GenAI Python Systems Engineer (2)   | Python, SQL, Cloud Platforms, GenAI/LLM              | https://www.linkedin.com/jobs/view/4373604638/ |
| Affirm              | Senior Machine Learning Engineer    | Python, Spark/Ray                                    | https://www.linkedin.com/jobs/view/4326673670/ |
| Lexis Nexis         | Lead Data Scientist                 | Cloud Platforms, GenAI/LLM                           | https://www.linkedin.com/jobs/view/4316327742/ |
| EY                  | AI Finance                          | SQL, Python, Azure, GenAI/LLM                        | https://www.linkedin.com/jobs/view/4385085950/ |
| Korn Ferry          | Sr. Data Scientist                  | Python, R, Spark, AWS, GenAI/LLM                     | https://www.linkedin.com/jobs/view/4387433496/ |
| Deloitte            | Data Science Manager                | Python, Cloud                                        | https://www.linkedin.com/jobs/view/4304674642/ |
| First Citizens Bank | Senior Quant Model Developer        | Python, SAS, SQL                                     | https://www.linkedin.com/jobs/view/4365378242/ |
| First Citizens Bank | Senior Manager Quant Analysis       | Python, SAS, Tableau                                 | https://www.linkedin.com/jobs/view/4388131284/ |
| Jobot               | ML Solution Architect               | Python, Scala, Spark, AWS, Snowflake                 | https://www.linkedin.com/jobs/view/4384023540/ |
| Affirm              | Analyst II                          | SQL, Python, R, CPLEX/Gurobi, Databricks/Snowflake   | https://www.linkedin.com/jobs/view/4373303038/ |
| Red Hat             | Sr Machine Learning Engineer (vLLM) | Python, GenAI/LLM                                    | https://www.linkedin.com/jobs/view/4354827922/ |
| Alliance Health     | Director AI                         | Python (TensorFlow/PyTorch), Office Products, GenAI  | https://www.linkedin.com/jobs/view/4383011480/ |
| Nubank              | ML Data Engineer                    | Python, Ray/Spark                                    | https://www.linkedin.com/jobs/view/4376815752/ |
| Target RWE          | Senior Quant Data Scientist         | R                                                    | https://www.linkedin.com/jobs/view/4385293724/ |
| Siemens             | Senior Data Analytics               | SQL, Python, R, Tableau/PowerBI                      | https://www.linkedin.com/jobs/view/4377969531/ |
| Red Hat             | Sr Machine Learning Engineer        | Python, GenAI/LLM                                    | https://www.linkedin.com/jobs/view/4302769773/ |
| Lexis Nexis         | Director Data Sciences              | Python, R, GenAI/LLM                                 | https://www.linkedin.com/jobs/view/4387335028/ |
| Cigna               | Data Science Senior Advisor         | Python, SQL                                          | https://www.linkedin.com/jobs/view/4381766145/ |
| Thermo Fisher       | Senior Manager Data Engineering     | Fabric, PowerBI, Python, Databricks, Tableau, SAS    | https://www.linkedin.com/jobs/view/4372684009/ |

Of the positions:

  • 9/25 roles included R, but only one required R exclusively. The other 8 were Python/SQL/R
  • 22/25 included Python
  • 11/25 had a focus on Generative AI or LLMs

Python dominates R in the current job market for data science positions. Professors are doing their students a disservice teaching R, the same way they would be doing a disservice teaching their students to code in Fortran.

Another aspect I noticed for this – analyst type jobs not all that long ago really only expected Excel (and maybe SQL). Now even the majority of the analyst jobs expect Python (even more so than dashboard tools like PowerBI in this sample).

For individuals on the job market, I suggest going and doing your own experiment job search like this on LinkedIn to see the tech skills you need to be able to at least get your foot in the door for an interview. I expected GenAI to be slightly more popular (only 11/25), but there were a few other technologies sprinkled in enough it may be good to become familiar with to widen your potential pool (Cloud and Spark – I am surprised Databricks was not listed more often).

If you’re looking to build Python skills from scratch, I cover this in my book: Data Science for Crime Analysis with Python (can purchase in paperback or epub at my store).

If also interested in learning about generative AI, see my book Large Language Models for Mortals: A Practical Guide for Analysts with Python.

You can use the coupon TWOFOR1 to get $30 off when purchasing multiple books from my store.

Using Claude Code to help me write

Using LLMs to help you write is understandably a touchy subject for many. There is quite a bit of AI slop coming out now, as it is really easy to just have the LLM tools think for you and write superficially OK but ultimately garbage prose.

My recent book, LLMs for Mortals, I used Sonnet 4.1 to write the initial draft of the book (for around $5). My prior book took around a year, whereas I was able to finish this book in around two months. I definitely did a ton of copy-editing (maybe around 20-30 hours per chapter on average), but I believe around 50% of the book material is the original Sonnet generated prose.

LLMs are a tool – they can be used poorly, but I think they can be used quite well. Pangram, a tool used to detect AI writing, does not flag any of the passages in LLMs for Mortals as AI generated.

This blog post goes over my notes on how I used Claude Code to help me write (although it really is applicable to any of the current coding tools, like Codex or Gemini as well). As a meta-reference, this blog post is 100% written by myself directly, but I will link to a draft written using Claude Code later in the post for a frame of reference.

Copy Editing

First, even if you do not agree with having an LLM write for you directly, there is a use case that should be relatively uncontroversial – having an LLM take a copy-edit pass on your work.

Here is an example I used this for recently, the blog post on Crime De-Coder goes over the benefits of using an API vs local LLMs. In this conversation, you can see my original draft, and the suggestions that Claude’s desktop tool (the free version) gave.

Again this is not really specific to Claude (this would have worked fine in ChatGPT as well). LLMs are good for not only spelling errors, but grammatical issues that spell check will not catch, as well as just more general copy-editing advice on the content.

One point of this – to replicate my setup, you need to write in plain text. Most of the things I write are in some form of markdown (plain markdown for blog posts, and Quarto for longer reports/books/etc). This makes it much easier to use the tools, especially the command line interface (CLI) tools like Claude Code.

Writing New Content

There are two big issues currently with LLM writing:

  • it is potentially wrong
  • current LLM writing has a particular style that is itself becoming noticeable

The first bullet, you need to review what it writes. It is much easier to have it write on content you are an expert in, so it is easier to review and spot errors. (It is the same current problem with using the tools to help you write computer code – they are boons for seniors but can write a ton of slop that more neophyte coders have a hard time spotting issues.)

The second bullet, having the style mimic your own, is what I am going to discuss here. It is worth understanding at a high level how generative AI LLMs work – if you ask “answer question X” vs “here is a book, …., answer question X” the LLM will generate a different response. The first part in the former prompt, “here is a book, …” is what is referred to the context. Current models have context windows (how large of a potential input) at around 500,000 words (technically they are around 1 million tokens, one word is often multiple tokens though).

You generally do not want to fill up the context window 100%, but 500,000 words is a very large number – just including text it would be multiple books. Another common prompting technique is what is called k-shot examples. It will typically go like

example input1: ...text... expected_output: ...blah...
example input2: ...diff text... expected_output: ...blah2...
....

This is what you place in the context window, then submit your usual prompt, and have the LLM generate the content. It is giving prior examples to help guide the LLM what you expect the final output to look like. This works the same way with writing – give the LLM prior examples of your writing to help it mimic your future style.

To keep it simple, I have created an example on github to follow along. Basically just have your prior writing (in text!), and then ask Claude Code something like:

review my prior blog posts in folder /blogposts, I am going to have you write a new blog post on topic X given the outline *after* you review the text

Then after your prior work is in the context window, feed the LLM an outline for what you want to write. In this example, I put the outline in an actual text file and said:

In the ClaudeWritingPost folder, review the outline.txt, then create a new md 
file, called ClaudeWritingExample.md, filling in the sections based on the 
outline

Claude Code will then go and review the text file with the outline and write the post. In the github repo I have my original outline for this same post, so you can see side-by-side.

You can technically write custom commands and skills with Claude Code (or the other CLI tools) to save the steps of typing two prompts, but to keep it simple for folks I am just showing the two steps manually. It is really just those two steps – get your prior examples into the context window, and then feed an outline for what else to write.

In the Github repo you can see some additional Claude.md files – these are files that include additional instructions. A common one I say is “do not include emojis”. LLM writing also tends to be verbose and have excessive lists. So I have instructions to avoid those as well.

The written blog post is not bad – I would suggest to go and read it as a proof of concept (I exported the session, can see it cost around fifty cents). Part of the reason I do not typically worry about blog posts is that I often add in things/change things in the process of writing. So you can see my personally written post is longer and has a few more elements.

So when would you use it? Technical writing, like writing tutorials in python, it works very well. Hence I could have it write the first pass on my LLM book and keep 50% of the content. I may use it for blog posts in the future (if I felt compelled to write something every day). But will not take that plunge for now.

For longer pieces, like an entire paper or a book, I suggest to not only make a detailed outline, but to also have the LLM write it in smaller sections. This both helps with reviewing the content, as well as to keep the LLM on track if you make edits/changes as you go. (Longer conversations it is more likely to degrade and make repeated errors.)

An Extra Note About Citations

I am not writing academic papers much anymore, but another fundamental problem with LLM writing is hallucinating citations. If you write in text markdown files, my suggestion is fairly simple – have the papers you want to cite in a bibtex file, and in-line in markdown, only cite papers in the form:

Citation, @item1 says blah [@item1; @item2]. For a specific page quote [@item1 p. 34-35].

The way I write my outlines, it typically is like write a paragraph about X, cite papers a,b,c. So my personal style of progressively filling in an outline works well with LLMs.

So this presumes you already have a list of papers (and are not using the LLM to dynamically write your lit review based on papers you have not read). Next time I actually need to write an academic paper, I may write up an MCP tool to query Semantic Scholar’s API and create a nice bibtex file.

But the solution here is again you need to review the output for accuracy. People without these tools are lazy and cite things they have not read already, so that will continue to happen (the tools just make it easier). Those that figure out how to use the tools appropriately though can be much more productive writing.

Some notes on the unreliability of LLM APIs

Because my book, LLMs for Mortals, was created with Quarto, it runs the code when I compile the book. It uses cached versions when no code changes, but it is guaranteed to be working code for the parts that have a grey input and a following green output, it is valid code that executed and generated the results.

I try to use temperature zero for most of the book, but some of the parts of the book are stochastic. Reasoning models you cannot set the temperature, so some elements of Chapter 3 introducing the models, and basically all of the section in chapter 6 on agents is stochastic. This actually gave me a better appreciation of some of the unreliability of these models, as for some instances it would fail, and others I needed to recompile because the output was poor.

The way jupyter caching works under the hood, it has a separate cache for the epub and the LaTeX document (that is used for the print version). So you technically get a slightly different book when you purchase epub vs paperback. When you have 60+ failure points per chapter (and that gets doubled when compiling to both epub and PDF), you get to glimpse a few of the warts of the API models.

These are also short snippets, so do not have error catching or more robust JSON parsing, so some of these issues I basically programmed away in production systems at work and did not even notice them. I figured my notes may be useful though in general for others trying to rely on these systems with large volume API calls.

OpenAI

All the models were generally reliable, but one of the examples of stochastic outputs in OpenAI gave me fits – I asked OpenAI to analyze a blog post on my Crime De-Coder site and get information from the post. Now this is a bit tricky, as the reasoning model needs to see the data is not available in the post directly, but in an image.

January 24th, at one point though this became totally unreliable in its output. It would often fail to download the additional image, and when it did, it was pretty inconsistent actually giving an accurate answer.

But now I can run below and this just returns fine and dandy near every time. Here is a loop I ran 5 times and it gives the correct answer (around 160 Tuesday at 4 AM).

from openai import OpenAI
import time

client = OpenAI()

prompt = """
Search <https://crimede-coder.com/blogposts/2024/Aoristic>, what is 
the maximum number of commercial burglaries in the chart and on what
day and hour? Do not use shorthand, give an actual number.

If you need to, download additional materials to answer the question.

Be concise in your output.
"""

for _ in range(5):
    # minimal reasoning with responses API
    response = client.responses.create(
        model="gpt-5.2",
        reasoning={'effort': 'low'},
        tools=[{"type": "web_search"}],
        input=prompt,
    )
    time.sleep(20) # to prevent going over my limit
    print(response.output_text)
    print('-------------')

My only guess is there was some downgrade in the model capabilities, and it routed behind the scenes for the reasoning models to some less capable model. (Just on January 24th though!)

Otherwise, the stochastic examples in the book using OpenAI were pretty reliable.

Anthropic

In the structured outputs chapter, I go through examples of parsing JSON vs progressively building on Pydantic outputs. I actually give examples where Pydantic schema’s can cause some filling in of data that you do not want (if the data should be null, and you use k-shot examples, it will often fill in from your last example).

So this chapter really is a ton of advice on prompt engineering for structured outputs. One example I show is using stop sequences when generating JSON and doing text parsing (which is really not necessary and best practice with Pydantic schemas, but I still use this with AWS Bedrock, since it does not support that yet).

This code works fine, what is inconsistent is that on very rare occasion, Anthropic’s API returns the bracket at the end of the call. This subsequently generates an error with this code, as it is invalid JSON with an extra bracket.

Production systems at the day job use AWS, and I wrote the text parsing in a way I would not even see this error (so not sure if it also happens with AWS). And it was quite rare with Anthropic, I just compiled the book enough times to notice this error happen on a few occasions.

Google

In the book I show off using Google Map grounding, since it is a unique capability of Googles – it was very unreliable. Not unreliable in the sense it would return an error and not be available, but unreliable in “I cannot find any google maps data right now”. So this would compile, I would just need to go look at the output and make sure it actually returned something useful.

You can see I switched to the Vertex API for this example – I cannot confidently say if Vertex was more reliable than the Gemini API for this. I experienced issues with both (maybe slightly fewer with Vertex).

The Anthropic error is not so bad – it causes an actual error in the system. The reasoning and LLM outputs something, but it is not good, troubles me more. We are really just piloting agentic systems at the day gig now with a small number of users – they have not gotten really stress tested by a large number of users. I don’t even want to think about how I would monitor maps grounding in production given my experience.

AWS

AWS I only had one example not consistently work – calling the DeepSeek API.

In the prior code calling Anthropic models via Bedrock, and later chapters I have an example of Mistral and different embedding models (Cohere and Amazon’s Titan), were all fine. Just this single example from DeepSeek would randomly not work. By not work the API would return a response, but the content would be empty. So the final print statement is where the error occurred, accessing text that did not exist.

Most of my work, even if DeepSeek is cheaper, I need to consider caching. So Haiku is pretty competitive with the other models. So I do not have much experience in Bedrock with any models besides Anthropic ones.

My biggest gripe with AWS is the IAM permissions are too difficult (and have changed over the past year). I was able to reasonably figure out how to use S3 Vectors and batch inference (which is discussed in the book). I was able to figure out Knowledge Bases, but I just took it out of the book (both too expensive for hobby projects to have the search endpoint). OpenAI’s vector search store is super easy though, so will definately consider that for traditional RAG applications moving forward.

Buy the book!

Use promo code LLMDEVS for 50% off of the epub. Or if you prefer purchase the paperback.

Large Language Models for Mortals book

I have published a new book, Large Language Models for Mortals: A Practical Guide for Analysts with Python. The book is available to purchase in my store, either as a paperback (for $59.99) or an epub (for $49.99).

The book is a tutorial on using python with all the major LLM foundation model providers (OpenAI, Anthropic, Google, and AWS Bedrock). The book goes through the basics of API calls, structured outputs, RAG applications, and tool-calling/MCP/agents. The book also has a chapter on LLM coding tools, with example walk throughs for GitHub Copilot, Claude Code (including how to set it up via AWS Bedrock), and Google’s Antigravity editor. (It also has a few examples of local models, which you can see Chapter 2 I discuss them before going onto the APIs in Chapter 3).

You can review the first 60 some pages (PDF link here if on Iphone).

While many of the examples in the book are criminology focused, such as extracting out crime elements from incident narratives, or summarizing time series charts, the lessons are more general and are relevant to anyone looking to learn the LLM APIs. I say “analyst” in the title, but this is really relevant to:

  • traditional data scientists looking to expand into LLM applications
  • PhD students (in all fields) who would like to use LLM applications in their work
  • analysts looking to process large amounts of unstructured textual data

Basically anyone who wants to build or create LLM applications, this is the book to help you get started.

I wrote this book partially out of fear – the rapid pace of LLM development has really upended my work as a data scientist. It is really becoming the most important set of skills (moreso than traditional predictive machine learning) in just the past year or two. This book is the one I wish I had several years ago, and will give analysts a firm grounding in using LLMs in realistic applications.

Again, the book is available in:

For purchase worldwide. Here are all the sections in the book – whether you are an AWS or Google shop, or want to learn the different database alternatives for RAG, or want more self contained examples of agents with python code examples for OpenAI, Anthropic, or Google, this should be a resource you highly consider purchasing.

To come are several more blog posts in the near future, how I set up Claude Code to help me write (and not sound like a robot). How to use conformal inference and logprobs to set false positive rates for classification with LLM models, and some pain points with compiling a Quarto book with stochastic outputs (and points of varying reliability for each of the models).

But for now, just go and purchase the book!


Below is the table of contents to review – it is over 350 pages for the print version (in letter paper), over 250 python code snippets and over 80 screenshots.

Large Language Models for Mortals: A Practical Guide for Analysts with Python
by Andrew Wheeler
TABLE OF CONTENTS
Preface
Are LLMs worth all the hype?
Is this book more AI Slop?
Who this book is for
Why write this book?
What this book covers
What this book is not
My background
Materials for the book
Feedback on the book
Thank you
1 Basics of Large Language Models
1.1 What is a language model?
1.2 A simple language model in PyTorch
1.3 Defining the neural network
1.4 Training the model
1.5 Testing the model
1.6 Recapping what we just built
2 Running Local Models from Hugging Face
2.1 Installing required libraries
2.2 Downloading and using Hugging Face models
2.3 Generating embeddings with sentence transformers
2.4 Named entity recognition with GLiNER
2.5 Text Generation
2.6 Practical limitations of local models
3 Calling External APIs
3.1 GUI applications vs API access
3.2 Major API providers
3.3 Calling the OpenAI API
3.4 Controlling the Output via Temperature
3.5 Reasoning
3.6 Multi-turn conversations
3.7 Understanding the internals of responses
3.8 Embeddings
3.9 Inputting different file types
3.10 Different providers, same API
3.11 Calling the Anthropic API
3.12 Using extended thinking with Claude
3.13 Inputting Documents and Citations
3.14 Calling the Google Gemini API
3.15 Long Context with Gemini
3.16 Grounding in Google Maps
3.17 Audio Diarization
3.18 Video Understanding
3.19 Calling the AWS Bedrock API
3.20 Calculating costs
4 Structured Output Generation
4.1 Prompt Engineering
4.2 OpenAI with JSON parsing
4.3 Assistant Messages and Stop Sequences
4.4 Ensuring Schema Matching Using Pydantic
4.5 Batch Processing For Structured Data Extraction using OpenAI
4.6 Anthropic Batch API
4.7 Google Gemini Batch
4.8 AWS Bedrock Batch Inference
4.9 Testing
4.10 Confidence in Classification using LogProbs
4.11 Alternative inputs and outputs using XML and YAML
4.12 Structured Workflows with Structured Outputs
5 Retrieval-Augmented Generation (RAG)
5.1 Understanding embeddings
5.2 Generating Embeddings using OpenAI
5.3 Example Calculating Cosine similarity and L2 distance
5.4 Building a simple RAG system
5.5 Re-ranking for improved results
5.6 Semantic vs Keyword Search
5.7 In-memory vector stores
5.8 Persistent vector databases
5.9 Chunking text from PDFs
5.10 Semantic Chunking
5.11 OpenAI Vector Store
5.12 AWS S3 Vectors
5.13 Gemini and BigQuery SQL with Vectors
5.14 Evaluating retrieval quality
5.15 Do you need RAG at all?
6 Tool Calling, Model Context Protocol (MCP), and Agents
6.1 Understanding tool calling
6.2 Tool calling with OpenAI
6.3 Multiple tools and complex workflows
6.4 Tool calling with Gemini
6.5 Returning images from tools
6.6 Using the Google Maps tool
6.7 Tool calling with Anthropic
6.8 Error handling and model retry
6.9 Tool Calling with AWS Bedrock
6.10 Introduction to Model Context Protocol (MCP)
6.11 Connecting Claude Desktop to MCP servers
6.12 Examples of Using the Crime Analysis Server in Claude Desktop
6.13 What are Agents anyway?
6.14 Using Multiple Tools with the OpenAI Agents SDK
6.15 Composing and Sequencing Agents with the Google Agents SDK
6.16 MCP and file searching using the Claude Agents SDK
6.17 LLM as a Judge
7 Coding Tools and AI-Assisted Development
7.1 Keeping it real with vibe coding
7.2 VS Code and GitHub Install
7.3 GitHub Copilot
7.4 Claude Code Setup
7.5 Configuring API access
7.6 Using Claude Code to Edit Files
7.7 Project context with CLAUDE.md
7.8 Using an MCP Server
7.9 Custom Commands and Skills
7.10 Session Management
7.11 Hooks for Testing
7.12 Claude Headless Mode
7.13 Google Antigravity
7.14 Best practices for AI-assisted coding
8 Where to next?
8.1 Staying current
8.2 What to learn next?
8.3 Forecasting the near future of foundation models
8.4 Final thoughts

Part time product design positions to help with AI companies

Recently on the Crime Analysis sub-reddit an individual posted about working with an AI product company developing a tool for detectives or investigators.

The Mercor platform has many opportunities that may be of interest to my network, so I am sharing them here. These include not only for investigators, but GIS analysts, writers, community health workers, etc. (The eligibility interviewers I think if you had any job in gov services would likely qualify, it is just reviewing questions.)

All are part time (minimum of 15 hours per week), remote, and can be in the US, Canada, or UK. (But cannot support H1-B or OPT visas in the US).

Additional for professionals looking to get into the tech job market, see these two resources:

I actually just hired my first employee at Crime De-Coder. Always feel free to reach out if you think you would be a good fit for the types of applications I am working on (python, GIS, crime analysis experience). I will put you in the list to reach out to when new opportunities are available.


Detectives and Criminal Investigators

Referral Link

$65-$115 hourly

Mercor is recruiting Detectives and Criminal Investigators to work on a research project for one of the world’s top AI companies. This project involves using your professional experience to design questions related to your occupation as a Detective and Criminal Investigator. Applicants must:

  • Have 4+ years full-time work experience in this occupation;
  • Be based in the US, UK, or Canada
  • minimum of 15 hours per week

Community Health Workers

Referral Link

$60-$80 hourly

Mercor is recruiting Community Health Workers to work on a research project for one of the world’s top AI companies. This project involves using your professional experience to design questions related to your occupation as a Community Health Worker. Applicants must:

  • Have 4+ years full-time work experience in this occupation;
  • Be based in the US, UK, or Canada
  • minimum 15 hours per week

Writers and Authors

Referral Link

$60-$95 hourly

Mercor is recruiting Writers and Authors to work on a research project for one of the world’s top AI companies. This project involves using your professional experience to design questions related to your occupation as a Writer and Author.

Applicants must:

  • Have 4+ years full-time work experience in this occupation;
  • Be based in the US, UK, or Canada
  • minimum 15 hours per week

Eligibility Interviewers, Government Programs

Referral Link

$60-$80 hourly

Mercor is recruiting Eligibility Interviewers, Government Programs to work on a research project for one of the world’s top AI companies. This project involves using your professional experience to design questions related to your occupation as a Eligibility Interviewers, Government Program. Applicants must:

  • Have 4+ years full-time work experience in this occupation;
  • Be based in the US, UK, or Canada
  • minimum 15 hours per week

Cartographers and Photogrammetrists

Referral Link

$60-$105 hourly

Mercor is recruiting Cartographers and Photogrammetrists to work on a research project for one of the world’s top AI companies. This project involves using your professional experience to design questions related to your occupation as a Cartographer and Photogrammetrist. Applicants must:

  • Have 4+ years full-time work experience in this occupation;
  • Be based in the US, UK, or Canada
  • minimum 15 hours per week

Geoscientists, Except Hydrologists and Geographers

$85-$100 hourly

Referral Link

Mercor is recruiting Geoscientists, Except Hydrologists and Geographers to work on a research project for one of the world’s top AI companies. This project involves using your professional experience to design questions related to your occupation as a Geoscientists, Except Hydrologists and Geographers Applicants must:

  • Have 4+ years full-time work experience in this occupation;
  • Be based in the US, UK, or Canada
  • minimum of 15 hours per week

Advice for crime analyst to break into data science

I recently received a question about a crime analyst looking to break into data science. Figured it would be a good topic for my advice in a blog post. I have written many resources over the years targeting recent PhDs, but the advice for crime analysts is not all that different. You need to pick up some programming, and likely some more advanced tech skills.

For background, the individual had SQL + Excel skills (which many analysts may just have Excel). Vast majority of analyst roles, you should be quite adept at SQL. But just SQL is not sufficient for even an entry level data science role.


For entry data science, you will need to demonstrate competency in at least one programming language. The majority of positions will want you to have python skills. (I wrote an entry level python book exactly for someone in your position.)

You likely will also need to demonstrate competency in some machine learning or using large language models for data science roles. It used to be Andrew Ng’s courses were the best recommendation (I see he has a spin off DeepLearningAI now). So that is second hand though, I have not personally taken them. LLMs are more popular now, so prioritizing learning how to call those APIs, build RAG systems, prompt engineering I think is going to make you slightly more marketable than traditional machine learning.

I have personally never hired anyone in a data science role without a masters. That said, I would not have a problem if you had a good portfolio. (Nice website, Github contributions, etc.)

You should likely start just looking and applying to “analyst” roles now. Don’t worry about if they ask for programming you do not have experience in, just apply. Many roles the posting is clearly wrong or totally unrealistic expectations.

Larger companies, analyst roles can have a better career ladder, so you may just decide to stay in that role. If not, can continue additional learning opportunities to pursue a data science career.

Remote is more difficult than in person, but I would start by identifying companies that are crime analysis adjacent (Lexis Nexis, ESRI, Axon) and start applying to current open analyst positions.

For additional resources I have written over the years:

The alt-ac newsletter has various programming and job search tips. THe 2023 blog post goes through different positions (if you want, it may be easier to break into project management than data science, you have a good background to get senior analyst positions though), and the 2025 blog post goes over how to have a portfolio of work.

Cover page, data science for crime analysis with python

What to show in your tech resume?

Jason Brinkley on LinkedIn the other day had a comment on the common look of resumes – I disagree with his point in part but it is worth a blog post to say why:

So first, when giving advice I try to be clear about what I think are just my idiosyncratic positions vs advice that I feel is likely to generalize. So when I say, you should apply to many positions, because your probability of landing a single position is small, that is quite general advice. But here, I have personal opinions about what I want to see in a resume, but I do not really know what others want to see. Resumes, when cold applying, probably have to go through at least two layers (HR/recruiter and the hiring manager), who each will need different things.

People who have different colored resumes, or in different formats (sometimes have a sidebar) I do not remember at all. I only care about the content. So what do I want to see in your resume? (I am interviewing for mostly data scientist positions.) I want to see some type of external verification you actually know how to code. Talk is cheap, it is easy to list “I know these 20 python libraries” or “I saved our company 1 million buckaroos”.

So things I personally like seeing in a resume are:

  • code on github that is not a homework assignment (it is OK if unfinished)
  • technical blog posts
  • your thesis! (or other papers you were first/solo author)

Very few people have these things, so if you do and you land in my stack, you are already at the like 95th percentile (if not higher) for resumes I review for jobs.

The reason having outside verification you actually know what you are doing is because people are liars. For our tech round, our first question is “write a python hello world program and execute it from the command line” – around half of the people we interview fail this test. These are all people who list they are experts in machine learning, large language models, years of experience in python, etc.

My resume is excessive, but I try to practice what I preach (HTML version, PDF version)

I added some color, but have had recruiters ask me to take it off the resume before. So how many people actually click all those links when I apply to positions? Probably few if any – but that is personally what I want to see.

There are really only two pieces of advice I have seen repeatedly about resumes that I think are reasonable, but it is advice not a hard rule:

  • I have had recruiters ask for specific libraries/technologies at the top of the resume
  • Many people want to hear about results for project experience, not “I used library X”

So while I dislike the glut of people listing 20 libraries, I understand it from the point of a recruiter – they have no clue, so are just trying to match the tech skills as best they can. (The matching at this stage I feel may be worse than random, in that liars are incentivized, hence my insistence on showing actual skills in some capacity.) It is infuriating when you have a recruiter not understand some idiosyncratic piece of tech is totally exchangeable with what you did, or that it is trivial to learn on the job given your prior experience, but that is not going to go away anytime soon.

I’d note at Gainwell we have no ATS or HR filtering like this (the only filtering is for geographic location and citizenship status). I actually would rather see technical blog posts or personal github code than saying “I saved the company 1 million dollars” in many circumstances, as that is just as likely to be embellished as the technical skills. Less technical hiring managers though it is probably a good idea to translate technical specs to more plain business implications though.

I translated my book for $7 using openai

The other day an officer from the French Gendarmerie commented that they use my python for crime analysis book. I asked that individual, and he stated they all speak English. But given my book is written in plain text markdown and compiled using Quarto, it is not that difficult to pipe the text through a tool to translate it to other languages. (Knowing that epubs under the hood are just html, it would not suprise me if there is some epub reader that can use google translate.)

So you can see now I have available in the Crime De-Coder store four new books:

ebook versions are normally $39.99, and print is $49.99 (both available worldwide). For the next few weeks, can use promo code translate25 (until 11/15/2025) to purchase epub versions for $19.99.

If you want to see a preview of the books first two chapters, here are the PDFs:

And here I added a page on my crimede-coder site with testimonials.

As the title says, this in the end cost (less than) $7 to convert to French (and ditto to convert to Spanish).

Here is code demo’ing the conversion. It uses OpenAI’s GPT-5 model, but likely smaller and cheaper models would work just fine if you did not want to fork out $7. It ended up being a quite simple afternoon project (parsing the markdown ended up being the bigger pain).

So the markdown for the book in plain text looks like this:

It ends up that because markdown uses line breaks to denote different sections, that ends up being a fairly natural break to do the translation. These GenAI tools cannot repeat back very long sequences, but a paragraph is a good length. Long enough to have additional context, but short enough for the machine to not go off the rails when trying to just return the text you input. Then I just have extra logic to not parse code sections (that start/end with three backticks). I don’t even bother to parse out the other sections (like LaTeX or HTML), and I just include in the prompt to not modify those.

So I just read in the quarto document, split by “”, then feed in the text sections into OpenAI. I did not test this very much, just use the current default gpt-5 model with medium reasoning. (It is quite possible a non-reasoning smaller model will do just as well. I suspect the open models will do fine.)

You will ultimately still want someone to spot check the results, and then do some light edits. For example, here is the French version when I am talking about running code in the REPL, first in English:

Running in the REPL

Now, we are going to run an interactive python session, sometimes people call this the REPL, read-eval-print-loop. Simply type python in the command prompt and hit enter. You will then be greeted with this screen, and you will be inside of a python session.

And then in French:

Exécution dans le REPL

Maintenant, nous allons lancer une session Python interactive, que certains appellent le REPL, boucle lire-évaluer-afficher. Tapez simplement python dans l’invite de commande et appuyez sur Entrée. Vous verrez alors cet écran et vous serez dans une session Python.

So the acronym is carried forward, but the description of the acronym is not. (And I went and edited that for the versions on my website.) But look at this section in the intro talking about GIS:

There are situations when paid for tools are appropriate as well. Statistical programs like SPSS and SAS do not store their entire dataset in memory, so can be very convenient for some large data tasks. ESRI’s GIS (Geographic Information System) tools can be more convenient for specific mapping tasks (such as calculating network distances or geocoding) than many of the open source solutions. (And ESRI’s tools you can automate by using python code as well, so it is not mutually exclusive.) But that being said, I can leverage python for nearly 100% of my day to day tasks. This is especially important for public sector crime analysts, as you may not have a budget to purchase closed source programs. Python is 100% free and open source.

And here in French:

Il existe également des situations où les outils payants sont appropriés. Les logiciels statistiques comme SPSS et SAS ne stockent pas l’intégralité de leur jeu de données en mémoire, ils peuvent donc être très pratiques pour certaines tâches impliquant de grands volumes de données. Les outils SIG d’ESRI (Système d’information géographique) peuvent être plus pratiques que de nombreuses solutions open source pour des tâches cartographiques spécifiques (comme le calcul des distances sur un réseau ou le géocodage). (Et les outils d’ESRI peuvent également être automatisés à l’aide de code Python, ce qui n’est pas mutuellement exclusif.) Cela dit, je peux m’appuyer sur Python pour près de 100 % de mes tâches quotidiennes. C’est particulièrement important pour les analystes de la criminalité du secteur public, car vous n’avez peut‑être pas de budget pour acheter des logiciels propriétaires. Python est 100 % gratuit et open source.

So it translated GIS to SIG in French (Système d’information géographique). Which seems quite reasonable to me.

I paid an individual to review the Spanish translation (if any readers are interested to give me a quote for the French version copy-edits, would appreciate it). She stated it is overall very readable, but just has many minor things. Here is a a sample of suggestions:

Total number of edits she suggested were 77 (out of 310 pages).

If you are interested in another language just let me know. I am not sure about translation for the Asian languages, but I imagine it works OK out of the box for most languages that are derivative of Latin. Another benefit of self-publishing, I can just have the French version available now, but if I am able to find someone to help with the copy-edits I will just update the draft after I get their feedback.

I scraped the Crime solutions site

Before I get to the main gist, I am going to talk about another site. The National Institute of Justice (NIJ) paid RTI over $10 million dollars to develop a forensic technology center of excellence over the past 5 years. While this effort involved more than just a website, the only thing that lives in perpetuity for others to learn from the center of excellence are the resources they provide on the website.

Once funding was pulled, this is what RTI did with those resources:

The website is not even up anymore (it is probably a good domain to snatch up if no one owns it anymore), but you can see what it looked like on the internet archive. It likely had over 1000+ videos and pages of material.

I have many friends at RTI. It is hard for me to articulate how distasteful I find this. I understand RTI is upset with the federal government cuts, but to just simply leave the website up is a minimal cost (and likely worth it to RTI just for the SEO links to other RTI work).

Imagine you paid someone $1 million dollars for something. They build it, and then later say “for $1 million more, I can do more”. You say ok, then after you have dispersed $500,000 you say “I am not going to spend more”. In response, the creator destroys all the material. This is what RTI did, except it was they had been paid $11 million and they were still to be paid another $1 million. Going forward, if anyone from NIJ is listening, government contracts to build external resources should be licensed in a way that prevents that from happening.

And this brings me to the current topic, CrimeSolutions.gov. It is a bit of a different scenario, as NIJ controls this website. But recently they cut funding to the program, which was administered by DSG.

Crime Solutions is a website where they have collected independent ratings of research on criminal justice topics. To date they have something like 800 ratings on the website. I have participated in quite a few, and I think these are high quality.

To prevent someone (for whatever reason) simply turning off the lights, I scraped the site and posted the results to github. It is a PHP site under the hood, but changing everything to run as a static HTML site did not work out too badly.

So for now, you can view the material at the original website. But if that goes down, you have a close to same functional site mirrored at https://apwheele.github.io/crime-solutions/index.html. So at least those 800 some reviews will not be lost.

What is the long term solution? I could be a butthead and tomorrow take down my github page (so clone it locally), so me scraping the site is not really a solution as much as a stopgap.

Ultimately we want a long term, public, storage solution that is not controlled by a single actor. The best solution we have now is ArDrive via the folks from Arweave. For a one time upfront purchase, Arweave guarantees the data will last a minimum of 200 years (they fund an endowment to continually pay for upkeep and storage costs). If you want to learn more, stay tuned, as me and Scott Jacques are working on migrating much of the CrimRXiv and CrimConsortium work to this more permanent solution.

Recommend reading The Idea Factory, Docker python tips

A friend recently recommended The Idea Factory: Bell Labs and the Great Age of American Innovation by Jon Gertner. It is one of the best books I have read in awhile, so also want to recommend to the readers of my blog.

I was vaguely familiar with Bell Labs given my interest in stats and computer science. John Tukey makes a few honorable mentions, but Claude Shannon is a central character of the book. What I did not realize is that almost all of modern computing can be traced back to innovations that were developed at Bell Labs. For a sample, these include:

  • the transistor
  • fiber optic cables (I did not even know, fiber is very thin strands of glass)
  • the cellular network with smaller towers
  • satellite communication

And then you get smattering of different discussions as well, such as the material science that goes into making underwater cables durable and shark resistant.

The backstory was that AT&T in the early 20th century had a monopoly on landline telephones. Similar now to how most states have a single electric provider – they were a private company but blessed by the government to have that monopoly. AT&T intentionally had a massive research arm that they used to improve communications, but also they provided that research back into the public coffers. Shannon was a pure mathematician, he was not under the gun to produce revenue.

Gertner basically goes through a series of characters that were instrumental in developing some of these ideas, and in creating and managing Bell Labs itself. It is a high level recounting of Gertner mostly from historical notebooks. One of the things I really want to understand is how institutions even tackle a project that lasts a decade – things I have been involved in at work that last a year are just dreadful due to transaction costs between so many groups. I can’t even imagine trying to keep on schedule for something so massive. So I do not get that level of detail from the book, just moreso someone had an idea, developed a little tinker proof of concept, and then Bell Labs sunk a decade an a small army of engineers to figure out how to build it in an economical way.

This is not a critique of Gertner (his writing is wonderful, and really gives flavor to the characters). Maybe just sinking an army of engineers on a problem is the only reasonable answer to my question.

Most of the innovation in my field, criminal justice, is coming from the private sector. I wonder (or maybe hope and dream is a better description) if a company, like Axon, could build something like that for our field.


Part of the point for writing blog posts is that I do the same tasks over and over again. Having a nerd journal is convenient to reference.

One of the things that I do not have to commonly do, but it seems like once a year at my gig, I need to putz around with Docker containers. For note for myself, when building python apps, to get the correct caching you want to install the libraries first, and then copy the app over.

So if you do this:

FROM python:3.11-slim
COPY . /app
RUN pip install --no-cache-dir -r /app/requirements.txt
CMD ["python", "main.py"]

Everytime you change a single line of code, you need to re-install all of the libraries. This is painful. (For folks who like uv, this does not solve the problem, as you still need to download the libraries everytime in this approach.)

A better workflow then is to copy over the single requirements.txt file (or .toml, whatever), install that, and then copy over your application.

FROM python:3.11-slim
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app
CMD ["python", "main.py"]

So now, only when I change the requirements.txt file will I need to redo that layer.

Now I am a terrible person to ask about dev builds and testing code in this set up. I doubt I am doing things the way I should be. But most of the time I am just building this.

docker build -t py_app .

And then I will have logic in main.py (or swap out with a test.py) that logs whatever I need to the screen. Then you can either do:

docker run --rm py_app

Or if you want to bash into the container, you can do:

docker run -it --rm py_app bash

Then from in the container you can go into the python REPL, edit a file using vim if you need to, etc.

Part of the reason I want data scientists to be full stack is because at work, if I need another team to help me build and test code, it basically adds 3 months at a minimum to my project. Probably one of the most complicated things myself and team have done at the day job is figure out the correct magical incantations to properly build ODBC connections to various databases in Docker containers. If you can learn about boosted models, you can learn how to build Docker containers.