Transformers can learn highly complex behaviors, but as a community, our understanding of how their internal computations give rise to these remains poor. Mechanistic interpretability focuses on building and testing explanatory models of how these internals operate. This project aims to contribute to this growing body of work, with a particular emphasis on how internal search and goal representations are processed within transformer models (and whether they exist at all!).
In particular, we take inspiration from existing mechanistic interpretability agendas and work with toy transformer models trained to solve mazes. Robustly solving mazes is a task we believe requires some kind of internal search process and gives us a lot of flexibility when it comes to exploring how distributional shifts affect performance — we believe that both understanding search and learning to control mesa-optimizers are important for the safety of AI systems. By focusing on mazes, we can have either explicit or implied goals, which is useful if we wish to find internal goal representations. Given how simple this task is, it’s also much easier to write evals for solving mazes (as opposed to, say, playing chess). We believe working with smaller models simplifies the research process and makes it easier to find structures in the model, but we acknowledge that applying methods created for toy models to LLMs is challenging.
You can checkout our website, and the details below, to learn more about our research efforts to date.
We highly recommend reading our previous work to better understand the nature of the proposed experiments:
maze-dataset
library that we will extensively use.None of these projects are set in stone — if you have ideas for other projects or better experiments, please share them! If a particular project is of interest, please say so, but if you are admitted you will have the choice of which project to work on.