Crossword Corpus

Changing Times

This post is part of a series on crossword puzzles, data, and language.

This post is all about how vocabulary can change over the years. Does the crossword puzzle have more variety than it used to? How often do new answers appear? Which answers are gaining popularity and which ones haven't been used for decades?

Words come and go

When new words are minted and enter the language, there are a handful of common ways it can happen, including:

Some words arise from a combination or sequence of transformations. If you take the greek word for something that gets imitated, "mimēma", then rework it to evoke the evolutionary implications of a "gene", you get a "meme". Or think about the word "tase". It's a back-formed verb derived from "taser", ostensibly a loose acronym for "Tom Swift and His Electric Rifle" plus a phonetic analogy for "laser".

Word death is a litte more nuanced. Some words fade into complete obsolescence -- a modern reader probably wouldn't recognize "ludibrious" as a bygone synonym for "ridiculous". Other words, however, when they fall out of fashion, hang around in our cultural memory. We may not use "forswear" or "jerkin" on a day-to-day basis, but we know what they mean. They're still part of the world we at least occasionally interact with, and in that sense remain in the periphery of our language. These merely archaic words are the realm of trivia, and they show up all over crossword puzzles. Think about answers like ERE or ANON. Is it fair to assume solvers will still be familiar with them? In that sense, word death is a somewhat more editorial process than word birth. But we can be on the lookout for a few things:

Crossword puzzles experience language change too. There are a few extra reasons why one answer might edge out another, and the whole language is highly curated, which is a little unusual. But we can find examples of almost all these same mechanims of word birth and death over our 30 years of NYT crosswords.

Answer births

Many new words and phrases fade away before ever seeing regular usage. A selection, however, take root and eventually enter the language. Here are some of the most recent popular answers to have made their debut in the crossword.

These are the 100 most recent answer "births", showing for each answer: what year it was first clued ("year introduced") and how many times it has ever appeared in an NYT puzzle ("occurrences"). When looking at births, we only care about answers that have actually caught on. One-offs can't really be said to have "entered the language", so we ignore them and focus on answers that have shown up at least a few times. Increasing the "minimum occurrences" will raise this requirement, imposing a stricter definition of what it means to be part of the language.

Here's another way to look at the same data. This table picks out, for each year, the ten answers introduced that year that would go on to have the highest all-time occurrences (filtering out answers that have appeared less than three times total). Kind of like a list of the "top ten best-selling albums released in 1959".

If you're wondering about EMPTYEMPTYEMPTY, there was a puzzle in 2020 whose theme answers were supposed to be left blank. In the actual file for the puzzle, each blank square contained a rebus reading "EMPTY".

Anyway, remember that list of ways words get invented? They're all here:

One thing missing from the list is famous names. It seems like names make up a big part of new crossword answers, but as we'll see they have a tendency to AGEOUT of the puzzle too.

Answer deaths

Just as new words enter a language, old words also die out. Here's a list of once-common answers that disappeared some time ago.

These are the 100 answers to be retired from the puzzle longest ago. They have to have been used a "minimum occurences" number of times and then never again. The "year disappeared" shows the last time each answer was used in the puzzle, while "year introduced" shows the first.

Let's look back at our list of ways words can become obsolete. Most of them are very slow processes that won't really be visible on the mere decades timescale of our data. Still, we can find a few examples!

Cultural decline in the form of obscure trivia makes up big part of answer death. Obsolescing words like AERODROME can actually stick around for a long time in the puzzle because esoteric knowledge is part of its appeal. The truly arcane, though, are forgotten by constructors or perhaps discouraged by editors. We haven't seen the pope's ORALE vestment, the MEUSE river, RIVA Ridge the racehorse, young eels called ELVERS, or the BRANT goose in a good long while. Famous names probably disappear for the same reason. Evidently Virna LISI, PERLE Mesta, and the SNEADS (golfers Sam and J.C.) aren't as well-known as they used to be.

One extra category of answer death mirrors a similar mechanism of answer birth: word derivation. It's often tempting when constructing a puzzle to use a normal word with an irregular prefix or suffix tacked on to help fit an awkward section of grid. This practice certainly hasn't gone away, and in fact helps explain some of the keywords that differentiate crossword answers from normal english. But mercifully, it seems like there's also an effort to phase out certain offenders. We haven't had to deal with TOTER, SMILER, or RESAY since 2006.

Try increasing the "minimum occurrences" to check answers that used to be even more popular. Many of them should fall into these categories!

Lifespans

We can compare an answer's birth and death to get its lifespan, the amount of time it saw active use. Here's a 2d histogram that counts (on a log scale) how many answers appeared in year X and disappeared in year Y.



The left-most column represents answers from the first year of the corpus, 1993. Going up the column, the box next to for example y=2010 shows how many answers first seen in 1993 were last seen in 2010. One thing to note: if that first column looks a little different than the rest, it's because we only have data from part of that year, so the totals for the whole column are a little low.

The straight line of high values running diagonally across the graph marks the overwhelming number of answers that have only ever appeared once -- they were born and died in the same year. If you increase the "minimum occurrences" to two, filtering out one-off answers, this line should darken significantly. On the other end of the spectrum, the bright spot in the top left indicates a lexicon of useful answers that have been a steady presence in the crossword corpus since its earliest days. You can decrease the "maximum occurrences" to remove some of these most common answers.

Unfortunately, the rest is a little hard to read. To get a better look, we can visualize their lifespan statistics.




These are box and whisker plots that roughly represent the distribution of lifespans for answers introduced each year. The "minimum occurrences" is already set to two, so the one-off answers are filtered out. We can use this plot to get a general sense of, for example, which years produced short- and long-lived words. If you increase the "minimum occurrences" to 10 and turn on outliers, you can see some "flash in the pan" answers that have been used many times, but in a shorter lifespan than normal. RIAA has appeared 12 times in the years from 2011 to 2018, but not since.

The dictionary

Of course, even seemingly short-lived or obscure answers may still make a comeback. New answers are regularly introduced, but constructors can also dredge up any answer from the past, so in a way solvers need to keep an ever-increasing list of possible answers in mind. This "crossword dictionary" is constantly growing.

This curve is a little remeniscent of Herdan's (or Heaps') law. In a 1960 book, Gustav Herdan described a common pattern that appears when analyzing corpora. Imagine going through our crossword puzzles one by one in chronological order. As you solve one puzzle, then the next, then the next, the number of unique answers you've seen -- the size of the "crossword dictionary" -- goes up. But as you keep solving puzzles, it's more and more likely that you'll come across repeat answers, so the dictionary starts growing more and more slowly. Quantitatively, it says:

V = kNh

where V is the size of the vocabulary (the number of unique answers in the dictionary), N is the size of the corpus (the total number of answers you've solved), and k and h are experimental constants, where h is usually less than 1.

What this tells us is that the dictionary of all crossword answers is growing at a slower and steadier rate as time goes by, but growing nonetheless. So if the number of possible answers is always increasing, has the variety of answers gone up with it? The NYT crossword is a daily publication with regular sizing and a maximum number of answers in any given puzzle, so its vocabulary shouldn't vary too wildly. Here's a graph showing the total number of unique answers used in each individual day/month/year.

timescale:



The first thing to be aware of here is the y-axis scale. These values aren't too different from one another, so it's a little hard to tell how statistically significant the results are. That said, there really does seem to be a nonrandom correlation in dictionary size year-to-year -- it seems like puzzles from the early 2010s genuinely had a bit more variety. I'm not quite sure how to explain this. It could be a result of which constructors happened to be most active during that era, or perhaps some behind-the-scenes editorial decision. Or there might be external reasons. Studies show that natural languages respond to social, technological, and cultural factors -- they change faster during wartime, for example [2] [3].

One fun crossword artifact: if you look at the "days" time scale and zoom in, you can make out a spike every Sunday when the grid is bigger!

Why do we care about any of this? Well for one thing, we can find out how often brand new answers crop up in the puzzle. This is a graph of how many answers first appeared each day/month/year.


timescale:



Turns out easily 5-10 answers make their NYT puzzle debut each day! If you increase the "minimum occurrences", you can even see that a couple of those will go on to be used again later on.

Caveats

A note about that last graph, answer births over time. It's a little hard to read because of Herdan's law -- births happen very quickly as you first start looking at puzzles, and then slower as repeats become more common. If we wanted to see past the steep curve and really look at how the birth rate has changed over time, we could use a technique called random shufflings [1]. If we were to shuffle up the puzzles, pretending they were published in a different order than they really were, we'd get a slightly different graph of answer births over time. We could compare our actual curve to the average shuffled curve to see which years really had more or fewer births. If you're interested, you can read more about this analysis here!

Also, a couple notes about Herdan's law. First, you might think that the power law growth curve of tokens vs. types reflects the power law in Zipf's observation about word frequencies and ranks. In fact the relationship between frequency distribution and Herdan's law is pretty complicated, and the mathematical form of Herdan's law is itself contested [1] [5], so I may have oversimplified a bit here.

Second, it's worth noting that Herdan's law should apply to any subdivision of a corpus [1]. So for example if you grouped puzzles by constructor and graphed each constructor's total tokens vs. types, you should still get a Herdan-like curve.

In general, there's plenty of other good reading out there about language change. One issue with the crossword corpus is its relatively short time scale. However some papers are using volatile communities like online forums to explore just such condensed dynamics [4]. For the complete opposite -- a consideration of massive amounts of data over extremely long time scales, take a look at this paper a bunch of researchers put together when Google Books was first released [6]!

References

[1] Chacoma A. and Zanette D. H. 2020. Heaps’ Law and Heaps functions in tagged texts: evidences of their linguistic relevance. R. Soc. open sci.7200008200008

[2] Petersen, A., Tenenbaum, J., Havlin, S. et al. Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death. Sci Rep 2, 313 (2012). https://doi.org/10.1038/srep00313

[3] Piantadosi, Steven T. “Zipf's word frequency law in natural language: a critical review and future directions.” Psychonomic bulletin & review vol. 21,5 (2014): 1112-30. doi:10.3758/s13423-014-0585-6

[4] Altmann EG, Pierrehumbert JB, Motter AE (2011) Niche as a Determinant of Word Fate in Online Groups. PLoS ONE 6(5): e19009.

[5] Font-Clos, Francesc, and Álvaro, Corral. "Log-Log Convexity of Type-Token Growth in Zipf’s Systems".Physical Review Letters 114, no.23 (2015).

[6] Michel, Jean-Baptiste, Yuan Kui, Shen, Aviva Presser, Aiden, Adrian, Veres, Matthew K., Gray, , Joseph P., Pickett, Dale, Hoiberg, Dan, Clancy, Peter, Norvig, Jon, Orwant, Steven, Pinker, Martin A., Nowak, and Erez Lieberman, Aiden. "Quantitative Analysis of Culture Using Millions of Digitized Books".Science 331, no.6014 (2011): 176–182.