GameBench is a benchmark to test how good large language models are at strategic reasoning by having them play games against each other and grading the results using an Elo-like rating system. We implemented 9 different games and tested a variety of models and scaffolding methods, as well as a human baseline.
Check out the project page here, the paper on arXiv, and the code on GitHub.
results
Here are the aggregated strategic reasoning scores for each tested model–scaffolding configuration, computed from match results using the Bradley–Terry model and normalized. The whiskers represent 90% confidence intervals from our bootstrapping process. cot
stands for Chain-of-Thought and rap
stands for Reasoning-via-Planning.
my contributions
- Wrote the code for one of the benchmark tasks, a game called “Santorini”. See the code on GitHub.
- Conducted a literature review on strategic reasoning in language models and pre-existing relevant benchmarks and wrote the related works section.
- Conducted factor analysis on the results to understand whether a single g-factor could explain a significant portion of the variance between models’ ability to play each game. This analysis did not end up in the final paper.
- Helped figure out a principled way to process match results into single ratings representing models’ overall performance. We ended up using the Bradley–Terry model to convert pairwise comparisons into performance ratings and a bootstrapping process to aggregate them across different games.
- Created the project page. I made it from scratch using the Astro web framework, basing the design off of Eliahu Horwitz’s much-used template, and then turned it into a generic template that anyone can use to make their own project pages.