The world of artificial intelligence is abuzz with excitement as researchers from MIT and Harvard have made significant strides in improving AI agents’ ability to ask insightful questions. In a groundbreaking study, these researchers have demonstrated that AI models can outperform humans in classic games like Battleship and Guess Who? by employing sophisticated questioning strategies.
The study, led by Gabriel Granda PhD student at MIT, and Jacob Andreasan associate professor of electrical engineering and computer science at MIT, focused on enhancing the information-seeking capabilities of language models (LMs). These models, which typically excel at answering complex queries, often struggle with asking informative questions, a crucial skill in high-stakes settings like medical diagnosis and scientific discovery.
Collaborative Battleship: A new approach to testing AI’s questioning skills
The researchers introduced a novel twist to the classic game of Battleship, renaming it Collaborative Battleship. In this version, one participant acts as the captainasking questions about the location of hidden ships, while the other plays the spotterresponding to these questions in real-time. To build a benchmark for comparison, the team first had over 40 humans play the game, collecting their questions and yes-no answers to create the BattleshipQA dataset.
With this dataset in hand, the researchers tested state-of-the-art LMs like GPT-5 and smaller models like Llama 4 Scout on their game. They found that top LMs could beat humans at Battleship, completing the game in fewer turns. However, smaller systems were far less rational, struggling to come up with useful questions.
Monte Carlo inference strategies: A game-changer for AI questioning
To address this issue, the researchers implemented Monte Carlo inference strategies in the LMs. This approach allows the models to reason about potential guesses as individual particles, weighting them more heavily as they appear more valid with each answer from the spotter. By adopting this calculated, adaptive approach, the captain could make inquiries that extracted considerably more information from the spotter.
The results were striking. Llama 4 Scouta relatively small LM, initially beat humans only 8 percent of the time. However, with refinements to its inference strategy, the model’s win rate soared to 82 percent against humans. Moreover, this careful and efficient style of asking questions enabled the model to outpace a frontier model, GPT-5while operating at around 1 percent of its cost.
Python to the rescue: Improving AI spotters’ accuracy
The researchers also turned to the widely used programming language Python to bolster AI spotters’ accuracy. By converting each question the captain asked into an encoded command, the spotter LM could search the area in question and assess the game piece’s dimensions. This approach gave the model clear directions in a language it understands particularly well, leading to a significant boost in correct answers.
For instance, the lightweight system GPT-4o-mini saw a nearly 30 percent performance bump, while even the large model Claude 4 Opus jumped about eight points. This improvement in answering questions shrank the gap between humans and LMs, making the AI models more reliable spotters.
Beyond Battleship: AI agents excel in Guess Who?
The researchers were curious about how their approach would fare in other board games. They tested their newly equipped LMs at Guess Who?where large and small models skillfully narrowed down 100 options to correctly guess the hidden character. Llama 4 Scout saw a significant improvement, completing the task on over 72 percent of its runs, up from 30 percent. Meanwhile, GPT-4o leapt from 62 percent to 90 percent.
While AI models have made promising progress in both games, there’s still room for improvement. The models struggle with complex questions compared to humans. However, the researchers’ findings show that AI agents have untapped potential in needle-in-a-haystack discovery, navigating a massive space of options to find rare solutions to scientific challenges.
The future of AI agents: Collaborative learning and complex environments
The researchers plan to explore further collaborations between humans and AI models to study whether they work better together. They also aim to fine-tune the models on game simulations and enhance their inference capabilities with more computing power. This will enable LMs to predict how a game will evolve more accurately.
As AI systems become more agenticthe hardest problems turn out to be social ones: tracking common ground, resolving misunderstandings, and adapting to different partners over time. This work elegantly captures these phenomena in a controlled collaborative setting, making a compelling case that the real bottleneck for AI agents isn’t just the calculation of optimal questions but the pragmatic reasoning needed to make the most of their answers.


