Workspace
Richie Cotton/

Wordle opening strategy

0
Beta
Spinner

Wordle background

Wordle is a popular free online word game, where the goal is to guess a five letter word. It's essentially the Mastermind 1970s board game, but with words.

With each guess, you get feedback as to which letters in your guess are in the answer in the same position, in the answer in a different position, or not in the answer. Using this feedback you can make better guesses, and you have 6 tries to rach the right answer.

For example, if the answer was "cigar" and I guesses "climb", then I get feedback that "c" is in the correct position, "i" is in the wrong position, and "l", "m", and "b" are not in the answer.

The goal of the opening couple of guesses is to get as much feedback as possible about which letters are in the solution in order to make more educated guesses with the remaining four tries.

USA Today suggests using "adieu" as an opener, since it contains four vowels. The Sun meanwhile suggests a two-vowel opener followed by a word with popular consonants like "t" or "s". These are intuitively reasonable ideas but neither newspaper seems to have performed any analysis to come up with optimal words.

Letter frequency of all words

An important concept in determining the strategy for word games like this is letter frequency. That is, some letters crop up in English more often than others. For example, "e" occurs much more often than "z". We can get the counts of each letter by importing a list of all English words. This analysis uses Mieliestronk's list of 58000 words. It uses British spellings rather than American - my gut feeling is that this means slightly more "s"s and "u"s and slightly fewer "z"s, but it's representative enough for this analysis.

The easiest way to retrieve this list of words is to use the read_csv() function from pandas. This give us a 1 column dataframe, which we can convert into a list. Although I can pull the data directly from the URL, I've downloaded it locally to save repeatedly accessing the webpage.

We need the arguments header=None, so it doesn't treat the first word as a header line, and keep_default_na=False so it doesn't treat the word "null" as a missing value.

import pandas as pd
words = pd.read_csv(
    "corncob_lowercase.txt", 
    header=None, 
    keep_default_na=False
)[0].to_list()
words[0:5]

The next step is to convert each word into a list of letters. Here I've used a list comprehension.

letters_by_word = [list(word) for word in words]
letters_by_word[0:5]

This list of lists isn't useful to work with, so we need to flatten it into a single list.

import itertools
letters = list(itertools.chain(*letters_by_word))
letters[0:10]

Next, we need to get counts of each letter, and visualize them. It's actually easier to do this by putting them back into a data frame. There's a bit of messing about in order to get the letters to be a categorical variable that will display in order of frequency from most common letter to least common.

letter_counts = pd.DataFrame({"letter": letters}) \
    .value_counts() \
    .reset_index() \
    .rename(columns={0: "n"}
)
letter_counts["letter"] = pd.Categorical(
    letter_counts["letter"], 
    letter_counts["letter"][::-1]
)
letter_counts.head()

OK, now we're ready to visualize the letter counts. A bar plot is great for this.

import plotnine as p9
bar_letter_counts_all_words = p9.ggplot(letter_counts, p9.aes(x="letter", y="n")) + \
    p9.geom_col() + \
    p9.coord_flip() + \
    p9.labs(x=None, y=None)
bar_letter_counts_all_words.draw();

For two opening words, we need ten letters. The top ten most popular letters are "e", "i", "s", "a", "r", "n", "t", "o", "l", "c". So we just need to find two words that contain these ten letters between them. "close" and "train" for example.

Limiting to five words

Of course, that analysis is based on every word in the English language, regardless of length. We can get a better answer by limiting to five letter words. For example, there are lots of common suffixes to words like "ing" or "tion" that are too long to appear in many five letter words, so these will skew our previous analysis.

Let's rerun it on the five letter words. To save going through everything step by step again, let's wrap the code into functions.

def get_letter_counts(letters_by_word):
    letters = list(itertools.chain(*letters_by_word))
    letter_counts = pd.DataFrame({"letter": letters}) \
        .value_counts() \
        .reset_index() \
        .rename(columns={0: "n"}
    )
    letter_counts["letter"] = pd.Categorical(
        letter_counts["letter"], 
        letter_counts["letter"][::-1]
    )
    return letter_counts
def plot_letter_counts(letter_counts):
    return p9.ggplot(letter_counts, p9.aes(x="letter", y="n")) + \
        p9.geom_col() + \
        p9.coord_flip() + \
        p9.labs(x=None, y=None)
letters_by_word5 = [x for x in letters_by_word if len(x) == 5]
letter_counts5 = get_letter_counts(letters_by_word5)
bar_letter_counts5 = plot_letter_counts(letter_counts5)
bar_letter_counts5.draw();