Understanding Search and Goal Representations in Transformers

Summary

Transformers can learn highly complex behaviors, but as a community, our understanding of how their internal computations give rise to these remains poor. Mechanistic interpretability focuses on building and testing explanatory models of how these internals operate. This project aims to contribute to this growing body of work, with a particular emphasis on how internal search and goal representations are processed within transformer models (and whether they exist at all!).

In particular, we take inspiration from existing mechanistic interpretability agendas and work with toy transformer models trained to solve mazes. Robustly solving mazes is a task we believe requires some kind of internal search process and gives us a lot of flexibility when it comes to exploring how distributional shifts affect performance — we believe that both understanding search and learning to control mesa-optimizers are important for the safety of AI systems. By focusing on mazes, we can have either explicit or implied goals, which is useful if we wish to find internal goal representations. Given how simple this task is, it’s also much easier to write evals for solving mazes (as opposed to, say, playing chess). We believe working with smaller models simplifies the research process and makes it easier to find structures in the model, but we acknowledge that applying methods created for toy models to LLMs is challenging.

You can checkout our website, and the details below, to learn more about our research efforts to date.

Project Proposal

Theory of Change

Our primary objective is to understand how transformers make decisions when trying to accomplish specific objectives.
- A particular process that we think is relevant to decision processes is how transformers use search to determine the tokens they predict. In particular, do transformers evaluate multiple future states when computing their distribution over the next token, and if so, how?
- We hope that this tells us something about the implicit goals transformers hold.
With knowledge of how transformers make various decisions (or represent their goals in a certain way), we aim to tackle various sub-questions, like whether we can retarget these decisions.
Ultimately, we hope that our research will help create systems that can detect when Transformer-based AIs have undesirable goals, and which can modify the goals of such AIs.

Previous work

We highly recommend reading our previous work to better understand the nature of the proposed experiments:

Post: Understanding mesa-optimization using toy models
- Research intuitions and a few proposed experiments
Preprint: **A Configurable Library for Generating and Manipulating Maze Datasets** (to be submitted)
- On our maze-dataset library that we will extensively use.
Paper: Linearly Structured World Representations in Maze-Solving Transformers (Accepted at UniReps workshop NeurIPS 2023, published soon)
- First results on maze-transformer performance, emerging world models, embedding-space analysis, etc.

Project Directions