Sentence length in academic articles

While reviewing a paper recently it struck me that the content was (very) good, but the writing was stereotypical academic. My first impression was that this was caused by monotonously long sentences. See this advice from Gary Provost (via Francis Diebold). Part of the reason why long sentences are undesirable is not only for aesthetic reasons though — longer sentences are harder to parse, hold in memory, and subsequently understand. See Steven Pinker’s The Sense of Style writing guide for discussion.

So I did some text analysis of the sentences. To do the text analysis I used the nltk library in python, and here is the IPython notebook to replicate for yourself if you care to do so (apparently Wakari is not a thing anymore, so here is the corpus for Huck Finn and for my Small Sample paper). In the notebook I have saved two text corpuses, one my finished draft of this article. I compared the sentence length to Mark Twain’s Huckleberry Finn (text via here).

For a simple example getting started with the library, here is an example of tokenizing a string into words and sentences:

#some tests for http://www.nltk.org/, nice book to follow along
import nltk
#nltk.download('punkt') #need to download this for the English sentence tokenizer files

#this splits up punctuation
test = """At eight o'clock on Thursday morning Arthur didn't feel very good. This is a second sentence."""
tokens = nltk.word_tokenize(test)
print tokens

ts = nltk.sent_tokenize(test)
print ts

The first prints out each individual word (plus punctuation in some circumstances) and the second marks individual sentences. I have the line #nltk.download('punkt') commented out, as I downloaded it once already. (Running once in Wakari I did not need to download it again – I presume it would work similarly on your local machine.)

So what I did was transfer the PDF document I was reviewing to a text file and then clean up things like the section headers (ditto for my academic articles I compare it to). In Huckleberry I took out the table of contents and the “CHAPTER ?” parts. I also started a list of variables that were parsed as words but that I did not want to count after the sentences and words were tokenized. For example, an inline cite such as (X, 1996) would be split into 4 words with the original tokenizer, (, X, 1996 and ). The “x96” is an en-dash. Below takes those instances out.

#Get the corpus
f = open('SmallSample_Corpus.txt')
raw = f.read()

#Count number of sentences
sent_tok = nltk.sent_tokenize(raw)
ns = len(sent_tok)

#Count number of words
word_tok = nltk.word_tokenize(raw) #need to take out commas plus other stuff
NoWord = [',','(',')',':',';','.','%','\x96','{','}','[',']','!','?',"''","``"]
word_tok2 = [i for i in word_tok if i not in NoWord]
nw = len(word_tok2)

#Average Sentence length are words divided by sentences
print float(nw)/ns

There are inevitably more instances of things that shouldn’t be counted as words, but that makes the sentences longer on average. For example, I spotted a few possessive 's that were listed as different words. (The nltk library is smart and lists contractions as seperate words.)

So someone may know a better way to count the words, but all the articles should have the same biases. In my tests, here are the average number of words per sentence:

  • article I was reviewing, 28
  • my small sample article, 27
  • my working article (that has not undergone review), 25
  • Huck Finn, 20

So the pot is calling the kettle black here – my writing is not much better. I looked at the difference between an in-print article and a working draft, as responses to reviewers I bet will make the sentences longer. Hedges in statements that academics love.

Looking at the academic article histograms they are fairly symmetric, confirming my impression about monotonous sentence length. To make the histograms I used the panda’s library, which has a nice simple method.

sent_len = []
for i in sent_tok:
    sent_w1 = nltk.word_tokenize(i)
    sent_w2 = [i for i in sent_w1 if i not in NoWord]
    sent_len.append(len(sent_w2))

import pandas as pd

dfh = pd.DataFrame(sent_len)
dfh.hist(bins = 50);

Here is the histogram for my small sample paper:

And here it is for Huck Finn

(I’m not much of an exemplar for making graphs in python – forgive the laziness in the figures.) Apparently analyzing sentence length has a long history, see a paper by G. Udny Yule in 1939! From a quick perusal the long right tail is more usual for analyzing texts. The symmetry I see for this sample of academic articles is not the norm.

There could be more innocuous reasons for this. Huck Finn has dialogue with shorter sentences, and the academic articles have numbers and citations. (Although I think it is reasonable to count those things towards sentence complexity, “1” or “one” should have the same complexity.)

I will have to keep this in mind in the future (maybe I should write my articles in poem form)!

Music and distractions in the workplace

I was recently re-reading Zen and the Art of Motorcycle Maintenance, and it re-reminded me of why I do not like to listen to music in the workplace. The thesis in Pirsig’s book (in regards to listening to music) is simple; you can’t concentrate entirely on the task at hand if you have music distracting you. So those who value their work tend to not have idle distractions like music playing (and be all engrossed in their work).

I have worked in various shared workspaces (cubicles and shared offices) for quite a while now, and I do have a knack for going off into space and ignoring all of the background noise around me. But I still do not like listening to music, even though I have learned to cope with the situation. At this point I prefer the open office workspace, as there at least is no illusion of privacy. When I worked at a cubicle someone coming behind me and scaring me was basically a daily thing.

Scott Adams, the artist of the Dilbert comic, had a recent blog post saying that music is the lesser evil compared to constant distractions via the internet (email, facebook, twitter, etc.) This I can understand as well, and sometimes I turn off the wi-fi to try to get work done without distraction. I don’t see how turning on music helps, but given its prevalence it may just be differences between myself and other people. I should probably turn off the wi-fi for all but an hour in the morning and an hour in the afternoon everyday, but I’m pretty addicted to the internet at this point.

It partly depends on the task I am currently working on though how easily I am distracted. Sometimes I can get really engrossed in a particular problem and become obsessed with it to the point you could probably set the office on fire and I wouldn’t notice. For example this programming problem dominated my thoughts for around two days, and I ended up thinking of the general solution while I did not have access to the computer (while I was waiting for my car to get inspected). Most of the time though I can only give that type of concentration for an hour or two a day though, and the rest of the time I am working in a state of easy distraction.

Background music I don’t like, and other ambient noises I can manage to drown out, but background TV drives me crazy. My family was watching videos (on TV and tablets) the other day while I was reading Zen and ironically I became angry, because I was really into the book and wanted to give it my full concentration. I know people who watch TV in bed to go to sleep, and it is giving me a headache just thinking about it while I am writing this blog post.

I highly recommend both Zen and the Art of Motorcycle Maintenance and Scott Adam’s blog. I’m glad I revisited Zen, as it is an excellent philosophical book on the logic of science that did not make much of an impression on me as an undergrad, but I have a much better grasp of it after having my PhD and reading some other philosophy texts (like Popper).

Solving problems as a metaphor for scientific writing

One analogy I hear in academics describing the process of writing a literature review is identifying the gaps in prior literature(s). I was reading Helping doctoral students write: Pedagogies for supervision recently, and Kamler & Thomson used this same analogy in describing the process of writing a literature review for a dissertation (although it is generally the same for shorter articles or books). Similar terminology Kamler & Thomson describe are blank spots and blind spots (see page 45). In that same chapter since Kamler & Thomson suggest the use of appropriate metaphors in describing the work of writing a literature review, I figured a critique of this one to be apropos.

I do not think the analogy is completely off base — but I do not like it as it does not jive with my personal experience of how I go about writing an article or thinking about research more generally. The first reason I do not like this terminology is that it has negative connotations for prior research. I think of building knowledge as a more cumulative endeavour as opposed to filling in between the lines of prior research.

For an analogy, say a researcher is attempting to improve the fuel efficiency of small combustible engines. It is likely they take mostly prior engineering knowledge about combustible engines and provide some modifications to slightly improve the design. Filling a gap implies to me an explicit design flaw in prior engines, when in reality it is more likely the researcher brings new knowledge to improve the design, and only in the context of the new research is the old design potentially described as inefficient. A social science example may be evaluating the costs and benefits to a particular policy in place by a public institution. The policy may be evidence based, and so an evaluation of the policy provides new information to that agency of whether it works as intended, or more general scientific knowledge about applying that policy in a real world setting. Neither seem to me filling in a gap, more so contributing and/or refining a set of knowledge already established.

I like the metaphor of the accumulation of knowledge, like a pyramid one brick at a time, better in terms of describing what I do when I write a literature review as opposed to identifying gaps. A convenient format for a literature review is to take a historical walk through the literature, and let the chronological order of previous findings be the guide for how you write the lit. review. But that metaphor is not sufficient to me either, as it implies a very linear structure, whereas prior research strikes me as more sphere-like — there is a base to which you add but the direction of the current research is not limited by the trajectory of the prior work. (A more accurate physical analogy may be an irregular growth of cells — they may meander in any particular direction but they always need to be connected to the prior work.) The scientific writer imposes a linear structure when describing prior work, but in reality the prior literatures are not that focused on whatever particular problem the current article is trying to address.

That is why I like the simple metaphor of identifying and solving a problem as a descriptor of what I do when I write a literature review – or even more broadly about describing the decisions I make in my research agenda. There are several reasons I prefer this analogy to either the accumulation of knowledge or identifying gaps. Identifying gaps implies you can read the prior literature and the gaps will be obvious — this is not the case. The prior literature is written in a particular context – the authors cannot anticipate future conditions or how that work will potentially be applied in the future. The gap does not exist in the current or prior literatures, you as a writer/researcher make the gap. I prefer problem solving as opposed to the accumulation of knowledge because it implies the focused nature of the endeavour. You do not simply write a paper to add a linear line of prior knowledge, you use that prior knowledge to solve a particular problem you have in your current context. It is your job as a researcher to basically say how the prior knowledge helps to solve that problem, and then advance the current knowledge to solve your particular problem. (This focus on giving the writer agency seems to be in line with most of Kamler & Thomson’s advice as well.)

This is how Popper described how knowledge actually accumulates — people have problems and they try to learn how to solve them. There is no prior divine truth to which future knowledge is added. We simply have problems, and some research may show a better solution to that problem than prior knowledge (be it whether the prior knowledge is well established or simply folklore). The analogy is not perfect, as many researchers would say they do not solve problems but are simply describe reality, but is a frame of reference I find useful to describe how I approach writing, describe my research, and in particular how I approach consuming the prior literature. It shows how I take the prior work and apply it to my interest, I am not a passive reader when trying to synthesize prior work.