All posts tagged LLM

Using Claude Code to help me write

Using LLMs to help you write is understandably a touchy subject for many. There is quite a bit of AI slop coming out now, as it is really easy to just have the LLM tools think for you and write superficially OK but ultimately garbage prose.

My recent book, LLMs for Mortals, I used Sonnet 4.1 to write the initial draft of the book (for around $5). My prior book took around a year, whereas I was able to finish this book in around two months. I definitely did a ton of copy-editing (maybe around 20-30 hours per chapter on average), but I believe around 50% of the book material is the original Sonnet generated prose.

LLMs are a tool – they can be used poorly, but I think they can be used quite well. Pangram, a tool used to detect AI writing, does not flag any of the passages in LLMs for Mortals as AI generated.

This blog post goes over my notes on how I used Claude Code to help me write (although it really is applicable to any of the current coding tools, like Codex or Gemini as well). As a meta-reference, this blog post is 100% written by myself directly, but I will link to a draft written using Claude Code later in the post for a frame of reference.

Copy Editing

First, even if you do not agree with having an LLM write for you directly, there is a use case that should be relatively uncontroversial – having an LLM take a copy-edit pass on your work.

Here is an example I used this for recently, the blog post on Crime De-Coder goes over the benefits of using an API vs local LLMs. In this conversation, you can see my original draft, and the suggestions that Claude’s desktop tool (the free version) gave.

Again this is not really specific to Claude (this would have worked fine in ChatGPT as well). LLMs are good for not only spelling errors, but grammatical issues that spell check will not catch, as well as just more general copy-editing advice on the content.

One point of this – to replicate my setup, you need to write in plain text. Most of the things I write are in some form of markdown (plain markdown for blog posts, and Quarto for longer reports/books/etc). This makes it much easier to use the tools, especially the command line interface (CLI) tools like Claude Code.

Writing New Content

There are two big issues currently with LLM writing:

it is potentially wrong
current LLM writing has a particular style that is itself becoming noticeable

The first bullet, you need to review what it writes. It is much easier to have it write on content you are an expert in, so it is easier to review and spot errors. (It is the same current problem with using the tools to help you write computer code – they are boons for seniors but can write a ton of slop that more neophyte coders have a hard time spotting issues.)

The second bullet, having the style mimic your own, is what I am going to discuss here. It is worth understanding at a high level how generative AI LLMs work – if you ask “answer question X” vs “here is a book, …., answer question X” the LLM will generate a different response. The first part in the former prompt, “here is a book, …” is what is referred to the context. Current models have context windows (how large of a potential input) at around 500,000 words (technically they are around 1 million tokens, one word is often multiple tokens though).

You generally do not want to fill up the context window 100%, but 500,000 words is a very large number – just including text it would be multiple books. Another common prompting technique is what is called k-shot examples. It will typically go like

example input1: ...text... expected_output: ...blah...
example input2: ...diff text... expected_output: ...blah2...
....

This is what you place in the context window, then submit your usual prompt, and have the LLM generate the content. It is giving prior examples to help guide the LLM what you expect the final output to look like. This works the same way with writing – give the LLM prior examples of your writing to help it mimic your future style.

To keep it simple, I have created an example on github to follow along. Basically just have your prior writing (in text!), and then ask Claude Code something like:

review my prior blog posts in folder /blogposts, I am going to have you write a new blog post on topic X given the outline *after* you review the text

Then after your prior work is in the context window, feed the LLM an outline for what you want to write. In this example, I put the outline in an actual text file and said:

In the ClaudeWritingPost folder, review the outline.txt, then create a new md 
file, called ClaudeWritingExample.md, filling in the sections based on the 
outline

Claude Code will then go and review the text file with the outline and write the post. In the github repo I have my original outline for this same post, so you can see side-by-side.

You can technically write custom commands and skills with Claude Code (or the other CLI tools) to save the steps of typing two prompts, but to keep it simple for folks I am just showing the two steps manually. It is really just those two steps – get your prior examples into the context window, and then feed an outline for what else to write.

In the Github repo you can see some additional Claude.md files – these are files that include additional instructions. A common one I say is “do not include emojis”. LLM writing also tends to be verbose and have excessive lists. So I have instructions to avoid those as well.

The written blog post is not bad – I would suggest to go and read it as a proof of concept (I exported the session, can see it cost around fifty cents). Part of the reason I do not typically worry about blog posts is that I often add in things/change things in the process of writing. So you can see my personally written post is longer and has a few more elements.

So when would you use it? Technical writing, like writing tutorials in python, it works very well. Hence I could have it write the first pass on my LLM book and keep 50% of the content. I may use it for blog posts in the future (if I felt compelled to write something every day). But will not take that plunge for now.

For longer pieces, like an entire paper or a book, I suggest to not only make a detailed outline, but to also have the LLM write it in smaller sections. This both helps with reviewing the content, as well as to keep the LLM on track if you make edits/changes as you go. (Longer conversations it is more likely to degrade and make repeated errors.)

An Extra Note About Citations

I am not writing academic papers much anymore, but another fundamental problem with LLM writing is hallucinating citations. If you write in text markdown files, my suggestion is fairly simple – have the papers you want to cite in a bibtex file, and in-line in markdown, only cite papers in the form:

Citation, @item1 says blah [@item1; @item2]. For a specific page quote [@item1 p. 34-35].

The way I write my outlines, it typically is like write a paragraph about X, cite papers a,b,c. So my personal style of progressively filling in an outline works well with LLMs.

So this presumes you already have a list of papers (and are not using the LLM to dynamically write your lit review based on papers you have not read). Next time I actually need to write an academic paper, I may write up an MCP tool to query Semantic Scholar’s API and create a nice bibtex file.

But the solution here is again you need to review the output for accuracy. People without these tools are lazy and cite things they have not read already, so that will continue to happen (the tools just make it easier). Those that figure out how to use the tools appropriately though can be much more productive writing.

Some notes on the unreliability of LLM APIs

Because my book, LLMs for Mortals, was created with Quarto, it runs the code when I compile the book. It uses cached versions when no code changes, but it is guaranteed to be working code for the parts that have a grey input and a following green output, it is valid code that executed and generated the results.

I try to use temperature zero for most of the book, but some of the parts of the book are stochastic. Reasoning models you cannot set the temperature, so some elements of Chapter 3 introducing the models, and basically all of the section in chapter 6 on agents is stochastic. This actually gave me a better appreciation of some of the unreliability of these models, as for some instances it would fail, and others I needed to recompile because the output was poor.

The way jupyter caching works under the hood, it has a separate cache for the epub and the LaTeX document (that is used for the print version). So you technically get a slightly different book when you purchase epub vs paperback. When you have 60+ failure points per chapter (and that gets doubled when compiling to both epub and PDF), you get to glimpse a few of the warts of the API models.

These are also short snippets, so do not have error catching or more robust JSON parsing, so some of these issues I basically programmed away in production systems at work and did not even notice them. I figured my notes may be useful though in general for others trying to rely on these systems with large volume API calls.

OpenAI

All the models were generally reliable, but one of the examples of stochastic outputs in OpenAI gave me fits – I asked OpenAI to analyze a blog post on my Crime De-Coder site and get information from the post. Now this is a bit tricky, as the reasoning model needs to see the data is not available in the post directly, but in an image.

January 24th, at one point though this became totally unreliable in its output. It would often fail to download the additional image, and when it did, it was pretty inconsistent actually giving an accurate answer.

But now I can run below and this just returns fine and dandy near every time. Here is a loop I ran 5 times and it gives the correct answer (around 160 Tuesday at 4 AM).

from openai import OpenAI
import time

client = OpenAI()

prompt = """
Search <https://crimede-coder.com/blogposts/2024/Aoristic>, what is 
the maximum number of commercial burglaries in the chart and on what
day and hour? Do not use shorthand, give an actual number.

If you need to, download additional materials to answer the question.

Be concise in your output.
"""

for _ in range(5):
    # minimal reasoning with responses API
    response = client.responses.create(
        model="gpt-5.2",
        reasoning={'effort': 'low'},
        tools=[{"type": "web_search"}],
        input=prompt,
    )
    time.sleep(20) # to prevent going over my limit
    print(response.output_text)
    print('-------------')

My only guess is there was some downgrade in the model capabilities, and it routed behind the scenes for the reasoning models to some less capable model. (Just on January 24th though!)

Otherwise, the stochastic examples in the book using OpenAI were pretty reliable.

Anthropic

In the structured outputs chapter, I go through examples of parsing JSON vs progressively building on Pydantic outputs. I actually give examples where Pydantic schema’s can cause some filling in of data that you do not want (if the data should be null, and you use k-shot examples, it will often fill in from your last example).

So this chapter really is a ton of advice on prompt engineering for structured outputs. One example I show is using stop sequences when generating JSON and doing text parsing (which is really not necessary and best practice with Pydantic schemas, but I still use this with AWS Bedrock, since it does not support that yet).

This code works fine, what is inconsistent is that on very rare occasion, Anthropic’s API returns the bracket at the end of the call. This subsequently generates an error with this code, as it is invalid JSON with an extra bracket.

Production systems at the day job use AWS, and I wrote the text parsing in a way I would not even see this error (so not sure if it also happens with AWS). And it was quite rare with Anthropic, I just compiled the book enough times to notice this error happen on a few occasions.

Google

In the book I show off using Google Map grounding, since it is a unique capability of Googles – it was very unreliable. Not unreliable in the sense it would return an error and not be available, but unreliable in “I cannot find any google maps data right now”. So this would compile, I would just need to go look at the output and make sure it actually returned something useful.

You can see I switched to the Vertex API for this example – I cannot confidently say if Vertex was more reliable than the Gemini API for this. I experienced issues with both (maybe slightly fewer with Vertex).

The Anthropic error is not so bad – it causes an actual error in the system. The reasoning and LLM outputs something, but it is not good, troubles me more. We are really just piloting agentic systems at the day gig now with a small number of users – they have not gotten really stress tested by a large number of users. I don’t even want to think about how I would monitor maps grounding in production given my experience.

AWS

AWS I only had one example not consistently work – calling the DeepSeek API.

In the prior code calling Anthropic models via Bedrock, and later chapters I have an example of Mistral and different embedding models (Cohere and Amazon’s Titan), were all fine. Just this single example from DeepSeek would randomly not work. By not work the API would return a response, but the content would be empty. So the final print statement is where the error occurred, accessing text that did not exist.

Most of my work, even if DeepSeek is cheaper, I need to consider caching. So Haiku is pretty competitive with the other models. So I do not have much experience in Bedrock with any models besides Anthropic ones.

My biggest gripe with AWS is the IAM permissions are too difficult (and have changed over the past year). I was able to reasonably figure out how to use S3 Vectors and batch inference (which is discussed in the book). I was able to figure out Knowledge Bases, but I just took it out of the book (both too expensive for hobby projects to have the search endpoint). OpenAI’s vector search store is super easy though, so will definately consider that for traditional RAG applications moving forward.

Buy the book!

Use promo code LLMDEVS for 50% off of the epub. Or if you prefer purchase the paperback.

Deep research and open access

Most of the major LLM chatbot vendors are now offering a tool called deep research. These tools basically just scour the web given a question, and return a report. For academics conducting literature reviews, the parallel is obvious. We just tend to limit the review to peer reviewed research.

I started with testing out Google’s Gemini service. Using that, I noticed almost all of the sources cited were public materials. So I did a little test with a few prompts across the different tools. Below are some examples of those:

Google Gemini question on measuring stress in police officers (PDF, I cannot share this chat link it appears)
OpenAI Effectiveness of Gunshot detection (PDF, link to chat)
Perplexity convenience sample (PDF, Perplexity was one conversation)
Perplexity survey measures attitudes towards police (PDF, see chat link above)

The report on officer mental health measures was an area I was wholly unfamiliar. The other tests are areas where I am quite familiar, so I could evaluate how well I thought each tool did. OpenAI’s tool is the most irksome to work with, citations work out of the box for Google and Perplexity, but not with ChatGPT. I had to ask it to reformat things several times. Claude’s tool has no test here, as to use its deep research tool you need a paid account.

Offhand each of the tools did a passable job of reviewing the literature and writing reasonable summaries. I could nitpick things in both the Perplexity and the ChatGPT results, but overall they are good tools I would recommend people become familiar with. ChatGPT was more concise and more on-point. Perplexity got the right answer for the convenience sample question (use post-stratification), but also pulled in a large literature on propensity score matching (which is only relevant for X causes Y type questions, not overall distribution of Y). Again this is nit-picking for less than 5 minutes of work.

Overall these will not magically take over writing your literature review, but are useful (the same way that doing simpler searches in google scholar is useful). The issue with hallucinating citations is mostly solved (see the exception for ChatGPT here). You should consult the original sources and treat deep research reports like on-demand Wikipedia pages, but lets not kid ourselves – most people will not be that thorough.

For the Gemini report on officer mental health, I went through quickly and broke down the 77 citations across the publication type or whether the sources were in HTML or PDF. (Likely some errors here, I went by the text for the most part.) For the HTML vs PDF, 59 out of 77 (76%) are HTML web-sources. Here is the breakdown for my ad-hoc categories for types of publications:

Peer Review (open) – 39 (50%)
Peer review (just abstract) 10 (13% – these are all ResearchGate)
Open Reports 23 (30%)
Web pages 5 (6%)

For a quick rundown of these. Peer reviewed should be obvious, but sometimes the different tools cite papers that are not open access. In these cases, they are just using the abstract to madlib how Deep Research fills in its report. (I consider ResearchGate articles here as just abstract, they are a mix of really available, but you need to click a link to get to the PDF in those cases. Google is not indexing those PDFs behind a wall, but the abstract.) Open reports I reserve for think tank or other government groups. Web pages I reserve for blogs or private sector white papers.

I’d note as well that even though it does cite many peer review here, many of these are quite low quality (stuff in MDPI, or other what look to me pay to publish locations). Basically none of the citations are in major criminology journals! As I am not as familiar with this area this may be reasonable though, I don’t know if this material is often in different policing journals or Criminal Justice and Behavior and just not being picked up at all, or if that lit in those places just does not exist. I have a feeling it is missing a few of the traditional crim journal sources though (and picks up a few sources in different languages).

The OpenAI report largely hallucinated references in the final report it built (something that Gemini and Perplexity currently do not do). The references it made up were often portmanteaus of different papers. Of the 12 references it provided, 3 were supposedly peer reviewed articles. You can in the ChatGPT chat go and see the actual web-sources it used (actual links, not hallucinated). Of the 32 web links, here is the breakdown:

Pubmed 9
The Trace 5
Kansas City local news station website 4
Eric Piza’s wordpress website 3
Govtech website 3
NIJ 2

There are single links then to two different journals, and one to the Police Chief magazine. I’d note Eric’s site is not that old (first RSS feed started in February 2023), so Eric making a website where he simple shares his peer reviewed work greatly increased his exposure. His webpage in ChatGPT is more influential than NIJ and peer reviewed CJ journals combined.

I did not do the work to go through the Perplexity citations. But in large part they appear to me quite similar to Gemini on their face. They do cite pure PDF documents more often than I expected, but still we are talking about 24% in the Gemini example are PDFs.

The long story short advice here is that you should post your preprints or postprints publicly, preferably in HTML format. For criminologists, you should do this currently on CrimRXiv. In addition to this, just make a free webpage and post overviews of your work.

These tests were just simple prompts as well. I bet you could steer the tool to give better sources with some additional prompting, like “look at this specific journal”. (Design idea if anyone from Perplexity is listening, allow someone to be able to whitelist sources to specific domains.)

Other random pro-tip for using Gemini chats. They do not print well, and if they have quite a bit of markdown and/or mathematics, they do not convert to a google document very well. What I did in those circumstances was to do a bit of javascript hacking. So go into your dev console (in Chrome right click on the page and select “Inspect”, then in the new page that opens up go to the “Console” tab). And depending on the chat browser currently opened, can try entering this javascript:

// Example printing out Google Gemini Chat
var res = document.getElementsByTagName("extended-response-panel")[0];
var report = res.getElementsByTagName("message-content")[0];
var body = document.getElementsByTagName("body")[0];
let escapeHTMLPolicy = trustedTypes.createPolicy("escapeHTML", {
 createHTML: (string) => string
});
body.innerHTML = escapeHTMLPolicy.createHTML(report.innerHTML);
// Now you can go back to page, cannot scroll but
// Ctrl+P prints out nicely

Or this works for me when revisiting the page:

var report = document.getElementById("extended-response-message-content");
var body = document.getElementsByTagName("body")[0];
let escapeHTMLPolicy = trustedTypes.createPolicy("escapeHTML", {
 createHTML: (string) => string
});
body.innerHTML = escapeHTMLPolicy.createHTML(report.innerHTML);

This page scrolling does not work, but Ctrl + P to print the page does.

The idea behind this, I want to just get the report content, which ends up being hidden away in a mess of div tags, promoted out to the body of the page. This will likely break in the near future as well, but you just need to figure out the correct way to get the report content.

Here is an example of using Gemini’s Deep Research to help me make a practice study guide for my sons calculus course as an example.

Andrew Wheeler

All posts tagged LLM

Using Claude Code to help me write

Copy Editing

Writing New Content

An Extra Note About Citations

Some notes on the unreliability of LLM APIs

OpenAI

Anthropic

Google

AWS

Buy the book!

Deep research and open access

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange