Crossword Corpus

Letter Boxes

This post is part of a series on crossword puzzles, data, and language.

Common letters are really important for making crossword puzzles fit together. But not just when you're finishing off a corner of the grid. From the very first clue you write, the letters you choose are secretly pushing your fill into predictable patterns.

Vowel placement

If you take a look at the most common crossword answers, it's obvious that vowels are pretty significant. All of the most frequently-repeated answers are chock full of As, Es, Os, and Is (throughout this post, I'm going to be ignoring Y -- I want to look at vowels as a factor in answer repetition, and Y doesn't really show up in the most common answers). Here's a graph showing vowel usage, ignoring Y and controlling for length (which we know affects frequency).

This is a histogram showing how many answers have x number of vowels and y frequency. The highest-frequency answers usually have one or two more vowels than is typical for their length.

What actually makes these high-frequency answers so...voweluable? It has to do with how the grid works, specifically the "interlocking" stipulation. Some crossword traditions, namely the British style, encourage more spaced-out block squares. What you get is a grid where about half the letters in each answer are uncrossed by another answer, because they're sandwiched between two block squares. A seven-letter answer will only have three or four crosses. If you've filled in all four crosses and still can't guess the answer, well, you're out of luck. American-style crosswords, on the other hand, generally stipulate that every letter must be crossed. That way, you have to be stumped by two answers, and intersecting ones at that, before you're really stuck.

Imagine you're writing a puzzle, and you need to choose an answer for 1-Across, the top left of the grid. Maybe you choose EXHORT. In a British-style crossword, you'd block off the squares under the second, fourth, and sixth letters. Now you just have to cross E, H, and R. This is great news because you were really dreading coming up with a word that started with X!

In an American-style puzzle, the grid gives you fewer places to hide inconvenient letters. You're stuck using every V, Q, and J. Vowels are especially bad. You need them all over the place to form actual words, but often they make for inconvenient crosses. Take a look at the distribution of vowels in crossword answers.



Each data point is a vowel distribution. For example, --V represents words that have two consonants followed by a vowel, like PHO. The x-axis shows what fraction of possible three-letter answers have a --V distribution. It says how many --V words there are in the dictionary. The y-axis shows what fraction of actual answers had a --V distributions. It says how often --V words showed up in the puzzle (it's our old friend types vs. tokens).

Let's say there are 100 three-letter words in the dictionary, and over the course of a year we use 1000 three-letter answers in our puzzles. Each three-letter word would be repeated on average ten times. Now look at -V-. The graph tells us that about a third of the 1000 three-letter puzzle answers have a consonant-vowel-consonant distribution. That's 333. It also says that, coincidentally, about a third of the 100 possible three-letter words to choose from have that same distribution. That's 33. We're choosing 333 answers from 33 possible words. So average repetition among -V- answers is 10, just what we'd expect if letter distribution didn't affect frequency.

But check out V-V. It looks like 18%, or 180, of our 1000 answers have a V-V distribution, but we're only choosing from 8%, or eight, of the 100 words in our dictionary. On average, those eight V-V answers have to get repeated 22 times!

In other words, distributions above the line have higher-than-average repetition. For every answer length, distributions that start and end with vowels are above the line.

What's going on here is that when you're constructing an American-style crossword and you put PAELLA in 1-Across, you've now got to come up with three crosses that start with vowels. Worse yet when you put in your theme answer THERMOPYLAE a few rows below it, suddenly two downs need to start and end with vowels. But there are really only a handful of words like that to choose from! That's why 18 of the top 20 most common crossword answers start and end with A, E, I, or O. It's supply and demand!

Filling is hard

Here's another way to think about it. Again, you're building a puzzle and choosing an answer for 1-Across. How many vowels will the answer have? It depends on answer length, of course. Say you pick a five-letter answer.

According to the graph above, five letter answers are most likely to have two vowels. Maybe you pick REORG. You've now committed yourself to finding two down answers that start with vowels. How many possible answers start with a vowel?

This is a graph showing the percent of distinct answers that have a vowel in any given letter position (assuming the answer is long enough). It should come as no surprise that relatively few answers start with a vowel. The E and the O in REORG will probably be tough to cross.

Now imagine you're picking another across answer, and you want to put it somewhere below 1-Across. The graph says there's about a 60% * 60% = 36% chance that 1-Across and the new answer both have vowels in the second position. If they do, suddenly your options for 2-Down are looking tough! Let's say you choose TENOR and you try to put it in the second row, just below REORG. Now 2-Down has to start with EE, which doesn't feel good. So maybe you put TENOR lower, in the fourth row. If you're aiming for four-letter downs, that would make 2-Down start and end with Es, which is also hard to fill! So you're left with no choice but to move TENOR up to row three. The two acrosses you've chosen, simply because of their vowel distributions, have locked in the fill and shape of the northwest corner.

The grid speaks

Don't believe me? Here's a heatmap of where vowels appear in 15x15 puzzles.

There's your northwest corner! We know that about 60% of answers in the dictionary have a vowel in the second letter. And the second letters of 1-Across and 1-Down have the highest vowel frequency in the grid. That's because almost every puzzle has 1-A and 1-D starting in the top left square. Here's a heat map showing where answers start.

Of course answers start most consistently in the top row and left-most column. Of those, 1-A and 1-D probably get chosen first most of the time, and it affects the starting letters of all the other answers. Here's a table of answers that begin in the Nth square of the north and west edges.

Square 1 is the first square of the north and west edges. That's the top left square of the grid, which is usually the start of 1-A and 1-D. Square 2 means the second squares of the north and west edges. Those are usually the squares that start 2-D and 14-A, respectively.

Looking at the table, 1-A/1-D often start with a consonant followed by a vowel: PAPA, HAHA, CASH, BASH. Those vowels fall in the second edge squares, the ones that usually start 2-D/14-A. As a result, 2-D/14-A (square 2 in the table) almost all start with vowels. And because not many words start with vowels, 2-D/14-A answers are repeated way more often than answers in other grid positions.

So what can you do to avoid locking up the corners? Well first of all, you can try to use fewer vowels on the edge of the grid. Second of all, if possible, try to pick answers that stagger vowel usage so they don't fall into common patterns. Third of all, consider anchoring your fill in the middle of the grid and putting a block square in the northwest to break the pattern in the corners.

Other valuable letters

Of course, vowels aren't the only letters to see heavy use in the crossword. Letter frequencies in english are well documented. Here's a comparison of letter usage in crosswords to english, where letters above the line are proportionally overused in crosswords.

Noticeably E and A are above the line. But H, I, and T are all below. My guess is that extemely common english words like I and THE account for this discrepancy. In general, common crossword answers tend to use common letters. Here's scrabble score plotted against frequency.

An answer's scrabble score is a helpful metric because it takes into account answer length and letter "difficulty". We know that both affect answer frequency, and scrabble score helps show how they interact. Still, it's not the full story. We just saw how hard vowels can be to accommodate in the grid, yet they only have a scrabble score of one. What would a crossword-specific letter scoring system look like?

Caveats

Unlike most of the other posts in crossword corpus, I haven't based the analysis here on any literature to speak of. Intuition and interpreataion flow largely unfiltered right from the data. As a result, I've skimped out on error propagation and statistical significance checks. I didn't want to leave that grain of salt unmentioned, but hopefully the numbers here were interesting all the same!