This post is part of a series on crossword puzzles, data, and language.
Just how common are the most common crossword answers? Find out here! A word on sources: I got a lot out of this article [1] by Dr. Steven T. Piantadosi, head of the computation and language lab at UC Berkeley. It gives a great breakdown of Zipf's law -- different formulations, fields of applicability, some theorized explanations from the literature, and the shortcomings of those explanations.
Certain words are especially useful when constructing a crossword, and they appear frequently in puzzles. Sometimes as solvers we'll complain that we see such "crosswordese" over and over again. There's no doubt that some answers show up more than others. But I think it's worth taking a look at just how common crosswordese really is.
In most languages, word frequency is distributed unevenly. A very small fraction of the words in the dictionary make up a very large fraction of the words you regularly speak or write. In a corpus, the most common words are often much much more common than, say, the 100th most common word.
It turns out word frequency falls off according to a power law. Let's say you have a language corpus whose most common word is THE, which shows up 120 times. For most languages, the next most common word in the corpus would only appear 120⁄2 = 60 times. The third-place word will appear 120⁄3 = 40 times, fourth-place 120⁄4 = 30 times, and so on and so on.
Or at least, that's what George Kingsley Zipf proposed in the 1930s. His idea, with some modifications, has not only stuck but spread to other applications. It's been used to describe city populations, wealth inequality, even the distribution of notes in music. Anything you can rank. The most frequent word in a corpus has rank one, the second-most frequent word has rank two, third-place has rank three, and so on. Zipf's law says that the rank rth most frequent word will have a frequency f that is a negative power law function of its rank. Mathematically:
f(r) = k⁄rα
for some normalizing constant k. For languages, you'd usually expect α to have a value around 1. Here's the rank-frequency graph for the top 1,000 most-frequent words in an english language corpus of wikipedia articles.
It looks like Zipf's law holds up pretty well. You can even see that the second-most common word, OF (72 million), is about 1⁄2 as common as the most frequent word, THE (152 million), just as the 100th-most common word, AMERICAN (1.5 million), is about 1⁄100th as common.
One useful thing about power laws is that if you take their logarithm, you get a linear equation. For the Zipf formula
f(r) = k⁄rα
when you apply logs, you get
log(f(r)) = -αlog(r) + log(k)
which is just a linear equation y = mx + b where:
All of which to say that if you plot a rank-frequency graph on logarithmic axes, and if it follows Zipf's law, you should get a straight line. Here's that graph for the first 100,000 most-frequent words in the wikipedia corpus.
This is a really good near-Zipfian distribution. The frequencies look almost linear using logarithmic axes. And by running a linear regression on the log of the data, you can fit them to a power law with an α close to one, just like GK Zipf predicted almost 100 years ago.
Of course, it's not that easy. You may have noticed that the fit line doesn't actually match the data very well for the first thousand or so words. In fact, a simple power law doesn't quite cut it. There have been a host of proposed distributions for a more refined version of Zipf's law, among them log-normal, Yule-Simon, and Mandelbrot's generalization of Zipf's law [1] [2]. Others have proposed fitting two separate power laws, one for the high-rank domain (ranks one, two, three, etc.) and one for the low-rank domain [3] [4].
You may also have noticed that the fit line matches the last 98,000 or so words suspiciously well. That's probably in part because of inherent correlation in rank-frequency graphs like this one [1]. The y-axis shows rank of a word's frequency while the x-axis shows...well...the word's frequency. Frequency is actually determining rank, so the plot of frequency vs. rank is guaranteed to monotonically descend. How do you know if you're looking at a meaningful trend?
The answer is to decorrelate the data -- you need to compare two separate corpora against each other. For more on how decorrelation works, check out the statistical notes down below.
Here's the rank-frequency graph for the top 500,000 most-frequent words in the english wikipedia corpus, only this time decorrelated and with an adjusted fit.
This is just what we'd expect from a near-Zipfian distribution. The high-rank domain on the left fits a power law very well, with α near one. The low-rank domain on the right fits a steeper power law function, and begins to fan out due to statistical uncertainty.
So does any of this apply to crossword puzzles? Let's take a look at the rank-frequency distribution of puzzle answers.
Now, we're working with a lot less data, but even so it's a pretty bad fit to Zipf's law. Most notably, the high-rank domain falls off quite slowly, with an α far below one. There are two possible interpretatios. Either crosswords have a lot more repetition in their mid-rank words, or the high-rank words aren't actually that frequent, relatively speaking.
Here are two things to consider. First, repeating an answer within a puzzle is frowned upon. In english, "the" could appear hundreds of times on a single page of a book, but crossword answers have a cap on usage density. Second, when you build a crossword, you're not trying to fill your grid with crosswordese. The way words and letters interact with the grid makes it hard to avoid sometimes, but with effort you can keep it to a minimum.
I favor the interpretation that high-rank words have relatively low frequencies compared to Zipf's law. If crosswords were more like normal english, answers of ERA, AREA, and ERE, would actually be way, way more common. So the next time you're feeling
On the other hand, even though repetition in crosswords happens less often than repetition in english, it's somehow more noticeable. It rankles. I think gilding common answers with flexibile cluing can go a long way to reducing the unpleasantness of repetition. It's the "spoonful of sugar" that helps crosswordese go down (or across). For more on cluing and feel-bad answers, check out the post on keyness.
It seems pretty likely that word meaning plays a role in Zipf's law. Think about it this way: some words are so vague that they aren't actually all that useful day-to-day. You don't go around saying "why did the entity cross the travelled way?". On the other hand, highly specific words, though occasionally very useful, are often not relevant. Words that apply in a medium range of contexts are more likely to be used often [5]. In part, you can think of Zipf's law as an observation about how often you need to communicate certain concepts.
One way to see how semantics affects frequency is to look at groups of very similar words -- words whose only difference is their meaning -- and compare their frequencies. A great example is number words ("one", "two", "three"...). It sounds kind of crazy but in many languages, frequency vs. cardinality follows a power law with α ≈ 2 [6].
The pattern is pretty clear for the english data. Numbers "two" through "nine" follow a power law very closely. There's a trough until "nineteen" because, come on, who writes out the teens? But Zipf's law picks back up again at "twenty" and holds reasonably well. Some exceptions are the nice round numbers "ten", "fifty", and "hundred". You might also notice "one" looking a little low -- I wonder if that's because its meaning is shared with "an" and "a"?
Looking at the crossword data, things are more all over the place. But there's one interesting feature: a clear power law from ONE to FIVE. That includes THREE, which is relatively longer than the others, and FIVE, with its tough-to-cross V. You might expect answer usage to be determined entirely by things like word length and vowel count, but meaning definitely plays a role too! Just think about words like THIS, THEIR, and BEING. They're relatively uncommon in crosswords partly because they're so darn bad to clue. A good puzzle, like good trivia, will use semantic specificity to its advantage.
There's a lot of writing out there about Zipf's law. I wanted to mention a couple details that didn't fit in the main post above.
First, a fun tie-in. It turns out you can formulate Zipf's law as the probability mass function of frequency [3] [7]:
Q(j) = kj-β
where Q(j) is the probability a given word is in the corpus j times, k is some constant, and β is about 2. This is exactly the same as the frequency marginal distribution discussed in the post on the brevity law!
Second, an excuse. I opted for using the two-domain approach to fitting Zipf's law because it was easy with the regression library I'm using. Worse still I totally hand-waved choosing the domain split rank. In theory you could fit the Zipf-Mandelbrot equation by running regressions for multiple trial values of β and maximizing some fit or likeliness function. If you want to learn more, you could check out [8] and [9].
Third, I wanted to explain a little more about how to decorrelate rank-frequency. Decorrelated data can only come from two separate corpora. If you plot rank from corpus A against frequency from corpus B, you're no longer guaranteed to get a monotonically descending rank-frequency. If you've only got one corpus, you need to randomly split it in two. In practice, for each word, you can use its original frequency as the characteristic n value of a binomial distribution with p = 0.5, then sample that distribution to get a value V. The word's new frequencies for your split corpora A and B are V and original frequency - V. It amounts to the same thing running through each instance of the word and assigning it to one or the other split corpus on a coin flip [1].
Thanks for sticking with me all the way to the end! That's it for this post.
[2] Baayen, R. Harald. Word frequency distributions. Vol. 18. Springer Science & Business Media, 2002.