It demonstrated that there was a hard limit on the complexity of a puzzle that LLMs could solve no matter how many tokens they threw at it (using a form of puzzle construction that it ensured that the LLM couldn't just refer to its training data to solve it).