The Experiment: The New York Times Sunday Crossword. The ultimate test of trivia, wordplay, and "lateral thinking." Can GPT-4 using Vision (looking at the grid screenshot) beat a fairly competent human (me) and a Professional Solver (my friend Alice)?
> THE CONTENDERS
> ROUND 1: TRIVIA CLUES (THE EASY PART)
Computers know everything. This should be a slaughter.
Result: Tie. I know how to Google. The AI knows how to query its database. No advantage.
> ROUND 2: THE PUNS (THE HUMAN EDGE)
Crosswords are famous for trickery. "Flower" can mean "Something that flows" (River), not a plant.
Correct Answer: HORSE. (Because horses live in Stables. It's a pun).
AI Failure: It took the clue literally. It missed the double meaning of "Stable."
> ROUND 3: THE LATERAL THINKING
The answer was FLAG. Flagstone. Flagstep? Wait, is Flagstep a word? No. The answer ended up being SOAP. (Soapstone. Soapbox... wait, clue was "Lead-in to step OR box").
I realized reading the grid is hard. The AI hallucinated the clue number. It was reading 46 Across.
> THE VISION PROBLEM
I fed the AI a screenshot of the grid. It struggled to correlate the numbers with the boxes.
AI: "I see a grid. But I cannot reliably tell which empty white box corresponds to 12 Down vs 13 Down."
I had to manually type the clues. This defeats the speed purpose.
> FINAL TIMES
The Verdict:
If I type the clues into ChatGPT, it solves the puzzle in 2 minutes. It is a super-genius. It figures out the puns eventually if I give it the letter count ("_ _ _ S E").
But if I ask it to "Look at the puzzle and solve it," it fails completely. It cannot navigate the spatial geometry of the grid.
> CONCLUSION
Crosswords are safe... for now. The joy of a crossword is the "Aha!" moment. The AI doesn't have "Aha!" moments; it just runs a probability search. It's like playing Chess against an engine. You can do it, but why? The struggle is the point.