My code can be found here and a blog post discussing the project in further detail is here.
This analysis is targeted towards people who have done some crossword puzzles but want to take their efforts more seriously and improve their game. While simply practicing doing crossword puzzles consistently is obviously a great way to improve, analysis on a dataset containing all the clues and answers from NYT puzzles between 1993-2021 reveals some trends about the words that commonly feature so that players can study them away from the puzzles and speed up their improvement.
After some descriptive analysis of the dataset using Python, including examining missing values and some feature engineering involving word length, days of the week, and vowel ratios of the words. After briefly analyzing the most-common clues, I make a list of the most commonly-featured words in these puzzles. This list isn’t a great study tool, though, because a lot of these words are already known to the average English-language speaker.
So, I cross-reference this list with a second list, 5,050 of the most common and currently-used words in the English language. By eliminating these words from the puzzle list, I create a valuable study guide: a list of words that are both common in the NYT Crossword puzzle, but are perhaps not known to the average person starting out.