Crossword Corpus

Short Answer Is

This post is part of a series on crossword puzzles, data, and language.

Answer repetition isn't necessarily a bad thing, and it's certainly not unique to crosswords. It may seem odd to have predictable answers in a test of general knowledge, but it happens all the time. Pub trivia or quiz bowl regulars probably know which keywords point towards Crécy vs. Agincourt, and fans of Jeopardy! will tell you to study up on Cincinnati slugger Pete Rose to help your chances (his nickname "Charlie Hustle" alone has come up seven times). But some crossword answers take things too far...

Frequent crossword Answers

No matter which crossword publisher you like best, odds are that after you solve a handful of puzzles, you'll start noticing repeated answers. Probably they're short, vowel-heavy words like ERA, ORE, and ALOE. Here are the most commonly-used crossword puzzle answers.

Repeated answers owe their popularity to their makeup as words. They get called names like "bad fill" and "crosswordese". It doesn't matter if they're linguistically or conceptually obscure as long as they're useful. In this post, we'll take a look at one of the most useful traits a word can have: brevity.

Linguistic laws

The first electronic collection of language data was published in 1961 as the "Brown University Standard Corpus of Present-Day American English," or the Brown Corpus. Long before then, linguists had been trying to identify, model, and explain patterns in observed language. But with computers at our disposal, it's gotten easier to double check, build upon, generalize, and refine those theories. Some of them turn out to apply across a surprising range of language collections (corpora) and contexts. Take Menzerath's law. In 1928, Paul Menzerath suggested that longer German words are often made up of shorter syllables. The concept was refined and generalized to other linguistic phenomena (e.g. longer sentences comprise shorter clauses) by Menzerath himself in 1954 and Gabriel Altmann in 1980. Many such observations have been quantified and codified into similar models -- things like the brevity Law, Zipf's Law, and Heaps' (aka Herdan's) Law. These so-called linguistic laws aren't the be all and end all of corpus analysis, but they do provide helpful starting points for digging into the data.

Word frequency

A number of quantitative linguistic observations relate to word frequency. After all, when you've got a big collection of language, one of the easiest ways to chop it up is by word. And one of the easiest things to do with a whole bunch of words is to count them! It's a lot harder to try and tag things like part of speech, harder still to quantify linguistic contexts. And then there's feature scale -- you could look at syllables or clauses or sentences or...yikes. Still, we shouldn't totally discount the usefulness of word frequency. After all, people have written whole books about it! [1] [2]. And it just so happens that word - or answer - frequency is exactly what we're interested in!

Brevity law

The brevity law advises (qualitatively) that common words tend to be shorter. Check out the table up above and you'll see that the most common crossword answers are all three or four letters long. You can play around with the "minimum answer length" input to see how the frequency drops off as length goes up.

A common way to measure this kind of qualitative tendency is using a correlation test [3]. First, you plot every distinct answer's length vs its frequency:

The x-axis shows answer length. The y-axis shows the log (base 10) of word frequency (so 1 is really 10, 2 is really 100, etc.). The brighter a box's color, the more words have that length and that frequency (color is also on a log 10 scale). Right away it seems like we're on to something. As answer length goes up past three, the most common frequency for words of that length goes down. We can run a statistical test to see how monotonic (Spearman) or linearly related (Pearson) the data are -- in other words, how reliably does the graph point roughly down and to the right?

Just like we'd expect, both the Pearson and Spearman tests show a significant negative correlation between length and frequency.

The probability mass function

I'm going to take a quick detour through the weeds. If you're not interested, feel free to jump forward to the next heading! As I was looking into word length and frequency, I came across a 2019 paper with a new suggestion about how to quantify the brevity law. It was written by Álvaro Corral and Isabel Serra at the Centre de Recerca Matemàtica in Barcelona [3]. They go through a pretty daunting (to me) analysis that culminates in a formulation of the brevity law in terms of conditional distributions of the probability mass function of type length and frequency. To break that down a little, (joint) probability mass is just the likelihood of choosing a word with a given length and frequency. So like, "what fraction of answers are five letters long and appear in 12 puzzles?" If that sounds like the histogram above, it should -- the histogram is basically a binned PMF. The "conditional" PMF in this case means the likelihood of choosing a word of a given frequency, where you're choosing from all words of a particular length. So like, "what fraction of five-letter words appear in 12 puzzles?" For crosswords, they look like this:

You can hide and show different lines using the legend. One of Corral and Serra's key observations is that these conditional probability distributions are roughly the same shape. If I'm understanding right, they use a scaling analysis to come up with a number that describes how the shapes relate to one another. That number is one way to quantify the brevity law. Really I ought to run a scaling analysis on the crossword data too, but just by eyeballing it, things look a little wrong. In particular, the low-frequency domain scales differently than the high. As length increases, the y-intercept goes up, while the x-intercept goes down. This makes a lot of sense to me, and I think maybe the reason it isn't addressed in the paper is that they largely ignore the < 10 frequency domain. Anyway, I'd love to dig into this analysis a little more but I think it's outside our scope for now. Do reach out if you have thoughts!

One last note: the authors also talk about the PMF's marginal distributions, which end up being pretty helpful for understanding what's going on with the brevity law. Marginals are the sum probability for each variable. The frequency marginal basically reduces to Zipf's Law, which you can read more about in its own post! The length marginal is the sum of the PMF across all frequencies for each length. In other words, how likely it is to pick an answer of a given length:

It turns out this is a pretty commonly examined distribution in its own right. But before we can talk about it, we need to define a couple of terms.

Tokens and types

Ok so when we talk about "picking an answer" from the corpus, it's not actually so simple. You can count answers two different ways -- either the total number of answers with length, say, four, OR the number of different distinct words with length four that show up as answers. Do you count OLEO 245 times or just once? You'll hear a range of terms for these concepts, like word "occurrences" (total) vs. "dictionary" words (distinct), or "tokens" (total) vs. "types" (distinct). Here's a graph of tokens and types:

Remember that the PMF earlier was the "probability mass function of type length and frequency?" It used types, distinct words. And hey, the length marginal distribution from earlier looks exactly like the "types" curve here (if it's hard to make out, try hiding the "tokens" curve by clicking on it in the legend). It says, for example, that there have been more distinct seven-letter answers than any other length. The "tokens" curve, on the other hand, counts duplicates multiple times. It peaks at four, meaning the NYT crossword asks you for a four-letter answer more often than any other length.

These two curves show up a fair amount in the literature. According to some papers, they should both fit a lognormal distribution [4] [5], although others contest this [3]. Linguists were running this kind of analysis in 1958 [6] and still running it in 2012 [7]. Here's how crosswords compare to a reference English corpus:

(Aside: choosing a reference corpus is a delicate business. I'm using a frequency list of english wikipedia words gathered in 2019 to represent written english, but there are a ton of other great resources. If anyone wants to buy me a COCA license, be my guest!)

The data line up pretty well! The "types" curves both have a primary peak around seven letters. The tokens look fairly similar too, with early peaks that fall off pretty quickly.

But we can recognize some characteristically "crossword-y" things. Words longer than seven letters are mostly underrepresented in the crossword corpus, which makes sense, since they can be harder to fit into a grid. There are relatively more types of lengths three to six, which might reflect that crosswords use a lot of abbreviations. The crossword tokens distribution is shifted towards towards the long end because crosswords generally don't have one or two letter answers, while one and two letter words are very common in normal english. There are noticeable spikes in the crossword curves at length 15, the width/height of a normal weekday puzzle, and 21, the width/height of a sunday puzzle. There are also smaller spikes at 23 and 25, other semi-common grid sizes. As you'd imagine, puzzle creators like to use answers as wide/tall as the grid itself -- check out this page for some impressive 15-letter-loving puzzles!

Answer repetition

So what can we say about repeated answers in the crossword? Well, let's go back to the brevity law. One way to think about it goes like this. There are 263 = 17,576 possible words of length three (using a 26 letter alphabet) and 266 = 308,915,776 possible six-letter words. Let's say in a given corpus there are 1000 three-letter words and 1000 six-letter word. The 1000 three-letter words come from a smaller pool of possible three-letter words, so each word has a tendency to show up more often.

We can apply the same thinking to the crossword corpus. Take a look back at the tokens and types curves. Anywhere that the tokens curve is high, we're using a lot of words of that length. Anywhere that the types curve is low, there aren't all that many distinct words to choose from. The bigger the gap between the two lines, the more times some answer of that length must have been repeated. It doesn't tell us about the distribution -- OLEO could have shown up hundreds of thousands of times and all other four-letter answers just once -- but it's a start!

Toggle on the "tokens - types" trace to look at one measure of repetition. It's highest at lengths three to five, which should come as no surprise given the table of common words at the beginning of this post. But if you zoom in to the long-answer end of the graph, you may notice that the difference doesn't actually hit zero until 21, which means some really long answers have actually showed up more than once! Here's a list of some oft-repeated long answers:

Almost all of these are phrases or proper nouns. It would seem that constructors, just like the rest of us, love Leonardo Da Vinci, Arturo Toscanini, and, of course, Grover Cleveland. The BLANKBLANK... answers are from a 2015 puzzle with the word BLANK as a rebus in every square around the perimeter of the puzzle. Just goes to show that language is a little weird in crossword puzzles. Answers can be made up of many words or fake words. And don't forget that a lot of the phenomena we've seen arise partly from the NYT's rules, which specify things like a minimum word count (incentivizing longer words) and diagonal symmetry (perhaps causing a more regular or characteristically shaped word length distribution).

Anyway, that's all for now, thanks for sticking around til the end!

References

[1] Popescu, I.I., G., Altmann, R., Kohler, P., Grzybek, and B.D., Jayaram. Word Frequency Studies. Mouton de Gruyter, 2009.

[2] Baayen, R. Harald. Word Frequency Distributions. Springer Netherlands, 2001.

[3] Corral, Álvaro; Serra, Isabel. 2020. "The Brevity Law as a Scaling Law, and a Possible Origin of Zipf’s Law for Word Frequencies" Entropy 22, no. 2: 224. https://doi.org/10.3390/e22020224

[4] HERDAN, G.. "THE RELATION BETWEEN THE DICTIONARY DISTRIBUTION AND THE OCCURRENCE DISTRIBUTION OF WORD LENGTH AND ITS IMPORTANCE FOR THE STUDY OF QUANTITATIVE LINGUISTICS".Biometrika 45, no.1-2 (1958): 222-228.

[5] Torre Iván G., Luque Bartolo, Lacasa Lucas, Kello Christopher T. and Hernández-Fernández Antoni. 2019. On the physical origin of linguistic laws and lognormality in speech. R. Soc. open sci.6191023191023

[6] G.A. Miller, E.B. Newman, & E.A. Friedman (1958). Length-frequency statistics for written English. Information and Control, 1(4), 370-389.

[7] Reginald D. Smith. "Distinct word length frequencies: distributions and symbol entropies." (2012).