Crossword Corpus

Check The Answer Key

This post is part of a series on crossword puzzles, data, and language.

In this post, we'll be using keyness to differentiate crosswordese from bad fill. For more information on keyness, check out Costas Gabrielatos's phenomenal chapter [1] in Corpus Approaches to Discourse [2], edited by Charlotte Taylor and Anna Marchi. I also got a lot of help from this log-likelihood wizard put out by Paul Rayson and UCREL at Lancaster University, as well as this chapter from Alvin Chen's National Taiwan Normal University course on corpus linguistics.

Relative frequency

Sometimes it feels like crosswordese plagues every puzzle you solve. But what is crosswordese? What makes certain answers feel a little cheesy or unfair?

Answers aren't bad just because they show up pretty often. What matters is whether or not you'd expect to see them in normal english. A word's relative frequency is how many times it shows up in a corpus, divided by the total number of words in that corpus. Crosswordese happens when there's a big difference between a word's relative frequency in english and its relative frequency in crosswords. Here's a table showing words with large absolute differences in relative frequencies.

A negative value in the "difference in relative frequency" column means the word is relatively more common in english than it is in the crossword. Right now, all of the words in the table have a negative difference.

These are grammatically useful words like articles, conjunctions, and prepositions. Makes sense. They're words that show up all over the place in normal english, but don't make very good crossword answers. Crosswords don't usually use linking words the same way english does, and anyway a lot of these would be really hard to clue.

To find crosswordese, what we care about are words that have a positive difference, which you can see by checking the "only positive values" box and reloading the data. A group of likely crosswordese suspects pops right up: ERE, ORE, ERIE, ALOE...three-to-four-letters, vowel-heavy. If you regularly do the NYT crossword, you'll have seen these answers every week.

(A note about the reference corpus: I'm using a frequency list of english words from wikipedia articles scraped in 2019. It'll produce different results than, say, a corpus of newspaper articles or of science fiction novels.)

(Also, as an aside, you might think words like ERA that show up semi-medium-often-ish in normal english -- and possibly even more in a reference corpus of wikipedia articles -- would have a lower relative frequency difference. So why is ERA topping the list? It's because word frequency is nonlinear. The most frequent words are disproportionately more common than less frequent words. If ERA is the most common crossword answer but only, say, the 1000^th most common english word, its crossword relative frequency will dwarf its english relative frequency. Check out the post on Zipf's law for more!)

Keyness

One thing to be aware of is that relative frequency difference favors high-frequency words. Imagine that some word X makes up 6% of the words in corpus A and 3% of the words in corpus B. Some other word Y makes up 2% of the words in corpus A and 1% of the words in corpus B.

	word X	word Y
rel. frequency in corpus A	6%	2%
rel. frequency in corpus B	3%	1%
rel. frequency difference	6 - 3 = 3%	2 - 1 = 1%
rel. frequency ratio	6 / 3 = 2x	2 / 1 = 2x

Both word X and word Y are exactly twice as common in corpus A as in corpus B, but their relative frequency differences diverge. Even though the ratio of frequencies is the same, the difference is greater for the high-frequency word. That's why relative frequency difference does a good job of picking out the most common crosswordese answers.

On the other hand, sometimes we want to see past those high-frequency words. Nouns and adjectives are almost always less common than articles and conjunctions, but they tell us a lot about what differentiates one corpus from another. As a result, metrics like relative frequency ratio are often used to help get a sense of which words are "key" to understanding what a corpus is about.

The concept of "keyness" combines effect-size measures with statistical measures. Effect-size measures tell us how big the difference is in a word's frequency between two corpora. Relative frequency ratio is an effect-size measure. Statistical measures tell us how reliable the difference is. The chi-squared test, for example, is a statistical measure.

Putting it all together, you can use a statistical metric to select only statistically significant results, and then sort them based on an effect-size metric. Here are the "key answers" from the crossword corpus. Answers in this table aren't necessarily common, but they are very unusual compared to english.

Here I'm using "bayes factor" and "log likelihood G2" as statistical metrics. A strong Bayes factor (or BIC) is around 6, and 10 is overwhelming. A strong G² value is around 19, and a value of 23 is overwhelming. We're filtering out anything with a Bayes factor <= 2. For effect-size, I'm using the base-two logarithm of relative frequency ratio as the effect-size metric. For every point of log ratio, a word appears (relatively) twice as many times in one corpus or the other. A log ratio of +14 means a word is about 16,000 times as common in the crossword as in english. If you switch over to using "relative frequency difference" for the effect size metric, you should see the same high-frequency crosswordese answers as before.

Using log ratio/keyness, what comes out is a very different list of words. In fact, they aren't really english words at all -- not a single one appears in the wikipedia reference corpus. Most of these key answers are actually phrases like EATAT and ORNOT. A handful like ENURE and EMEER exploit variant spellings. Several tack bizarre prefixes or suffixes to normal words (ABOIL, EELED, ATILT), while others remove parts of otherwise conventional words (ACERB, -EROO). Many combine two or more such cheats, like abbreviated phrase SDAK (for South Dakota) and foreign-language phrase ADUE. Even if you increase the "minimum english corpus frequency" a little, so that the answers have at least shown up in the reference corpus, you'll still find a lot of phrases (ATEASE), variants (ODIST, someone who writes odes), and jargon (ALIENEE).

Looking back, we've identified two different categories of answer. Relative frequency difference showed us repetitive crosswordese. I think log ratio/keyness is showing us some of the more egregious bad fill. Answers where you go "that's not a word!" or "EVENER!? Come on!". These, to me, are some of the true feel-bad answers.

Cluing key answers

One of the main differences between crosswordese and bad fill is how they're clued. Because they tend to be used more often in english, crosswordese answers can be clued in a bunch of creative ways. ALA can mean "in the style of", it can be short for Alabama, or it can fit into a host of everyday phrases (a la mode, a la carte...). It seems like key answers, on the other hand...well, check out the clues for ALIENEE.

Pretty much word-for-word the same since its first appearance in 1996. Try looking up ELEE or RETAG.

Sometimes repetitive cluing can be fun. It's satisfying to look at a clue and jump to conclusions: "'Church part' -- okay, it's gotta be NAVE or APSE!". But maybe when cluing gets really restrictive, it's a sign of bad fill.

Keyness over time

There's one last sort of neat thing we can do with keyness. By analyzing each year's puzzles against the whole crossword corpus, we can look at which answers were key in a given year.

Notice that the statistical metrics are a lot lower, which makes sense because the sample sizes are a lot smaller. So it's not exactly a statistical fact that ANAGRAMS were super trendy in 2014 or that 2016 got really into ROMCOMs. But it's a fun way to look at cultural trends and to track the spread of neologisms. For more, check out the post on language change!

The culprit

At this point you might be wondering why we have to put up with repetitive crosswordese and disappointing bad fill. It turns out that the way answers interlock on the grid makes certain letters and words extra valuable for construction. That's why many crosswordese answers start and end with vowels. It's also where those weirdly prefixed bad fill answers like ANEAR come from. For more, check out the post on letter distributions!

Caveats

One thing I think would be pretty cool but haven't done yet is a "sameness" analysis. You can look at log ratios close to zero to find words with similar relative frequencies in crosswords and english. It's possible things like semantic specificity would come into play there?

Another thing I'd really love to do with keyness is to factor in part of speech. I wonder if certain answers like EASE and ECHO get clued more often in their verb or noun form, and how that compares to english!

References

[1] Gabrielatos, Costas. “Chapter 12 : Keyness Analysis : nature , metrics and techniques.” (2017).

[2] Taylor, C., and A., Marchi. "Corpus Approaches to Discourse: A Critical Review".Taylor & Francis, 2018.