Tom EverittStaff Research Scientist Google DeepMind Email: tomeveritt at google.com |
I'm a Staff Research Scientist at Google DeepMind, leading the Causal Incentives Working Group.
I'm working on AGI Safety, i.e. how we can safely build and use highly intelligent AI. My PhD thesis Towards Safe Artificial General Intelligence is the first PhD thesis specifically devoted to this topic. Since then, I've been building towards a theory of alignment based on Pearlian causality, summarised in this blog post sequence.
Recent papers:
Robust agents learn causal world models.
Jonathan Richens, Tom Everitt.
ICLR Oral, 2024. Honorable mention outstanding paper award.
The Reasons that Agents Act: Intention and Instrumental Goals:
Formalises intent in causal models and connects it with a behavioural characterisation that can be applied to LLMs.
Francis Rhys Ward, Matt MacDermott, Francesco Belardinelli, Francesca Toni, Tom Everitt.
AAMAS, 2024
Characterising Decision Theories with Mechanised Causal Graphs:
Shows that mechanised causal graphs can be used to cleanly define different decision theories.
Matt MacDermott, Tom Everitt, Francesco Belardinelli
arXiv, 2023
Honesty Is the Best Policy: Defining and Mitigating AI Deception:
Formal definition of intent and deception and graphical criteria. RL and LM experiments to illustrate.
Francis Rhys Ward, Tom Everitt, Francesco Belardinelli, Francesca Toni.
NeurIPS, 2023.
Human Control: Definitions and Algorithms:
We study definitions of human control, including variants of corrigibility and alignment, the assurances they offer for human autonomy, and the algorithms that can be used to obtain them.
Ryan Carey, Tom Everitt
UAI, 2023
A full list of publications is available here and at my dblp and Google Scholar.
Below I list my papers together with some context.
An overview of how Pearlian causality can serve as a foundation for key AGI safety problems:
An accessible and comprehensive overview of the emerging research field of AGI safety:
AGI Safety Literature Review Tom Everitt, Gary Lea, and Marcus Hutter In International Joint Conference on AI (IJCAI) and arXiv, 2018.
A machine learning research agenda for how to build safe AGI:
Scalable agent alignment via reward modeling: a research direction Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg In arXiv and blog post , 2018. Two Minute Papers video
The UAI/AIXI framework is a formal model of reinforcement learning in general environments. Many of my other works are based on variations of this framework:
Gridworlds make AGI safety problems very concrete:
AI Safety Gridworlds Jan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg In arXiv and GitHub, 2017. Computerphile video.
Modeling AGI Safety Frameworks with Causal Influence Diagrams Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg In IJCAI AI Safety Workshop and arXiv, 2019.
The focus of my most of my work has been to understand the incentives of powerful AI systems.
General method. There is a general mehtod for inferring agent incentives directly from a graphical model.
Agent Incentives: A Causal Perspective Tom Everitt, Ryan Carey, Eric Langlois, Pedro Ortega, Shane Legg In AAAI and arXiv, 2021.
Understanding Agent Incentives using Causal Influence Diagrams (mostly superseded by AI:ACP) Tom Everitt, Pedro A. Ortega, Elizabeth Barnes, Shane Legg In arXiv and blog post, 2019. Independent Chinese translation.
The Incentives that Shape Behavior (mostly superseded by AI:ACP) Ryan Carey, Eric Langlois, Tom Everitt, Shane Legg In arXiv and blog post, and to be presented at the SafeAI AAAI workshop, 2020. Independent Chinese translation.
Causal Reasoning in Games Lewis Hammond, James Fox, Tom Everitt, Alessandro Abate, Michael Wooldridge working paper, 2022.
A Complete Criterion for Value of Information in Soluble Influence Diagrams Chris van Merwijk, Ryan Carey, Tom Everitt In AAAI and arXiv, 2022.
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice Lewis Hammond, James Fox, Tom Everitt, Alessandro Abate, Michael Wooldridge In AAMAS and arXiv, 2021.
Discovering Agents Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, Tom Everitt In arXiv, DeepMind blog, and alignmentforum, 2022.
Fairness. When is unfairness incentivised? Perhaps surprisingly, unfairness can be incentivized even when labels are completely fair:
Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness Carolyn Ashurst, Ryan Carey, Silvia Chiappa, Tom Everitt In AAAI and arXiv, 2022.
Reward tampering. Various ideas in the AGI safety literature can be combined to form RL-like agents without significant incentives to interfere with any aspect of its reward process, be it their reward signal, their utility function, or the online training of their reward function.
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective Tom Everitt and Marcus Hutter In Synthese, arXiv and blog post, 2021. Independent Chinese translation.
The Alignment Problem for Bayesian History-Based Reinforcement Learners Tom Everitt and Marcus Hutter. Technical report, 2018. Winner of the AI Alignment Prize.
Path-Specific Objectives for Safer Agent Incentives Sebastian Farquhar, Ryan Carey, Tom Everitt In AAAI and arXiv, 2022.
If the reward signal can be (accidentally) corrupted, this paper explains why both richer feedback and randomized algorithms (quantlization) improve robustness to reward corruption.
Reinforcement Learning with Corrupted Reward Channel Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. In IJCAI-17 and arXiv, 2017. Blog post, Slides, Victoria's talk.
Following up on this work, we generalize the framework of CRMDPS in the previous paper to arbitrary forms of feedback, and apply the idea of decoupled feedback to approval-directed agents in a 3D environment with integrated tampering called REALab:
REALab: An Embedded Perspective on Tampering Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg In arXiv and DMSR blog post, 2020. Independent Chinese translation.
Avoiding Tampering Incentives in Deep RL via Decoupled Approval Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg In arXiv and DMSR blog post, 2020. Independent Chinese translation.
Corrigibility Different RL algorithms react differently to user intervention. The differences can be analyzed with causal influence diagrams:
How RL Agents Behave when their Actions are Modified Eric Langlois, Tom Everitt In AAAI and arXiv, 2021.
Self-modification. Subtly different design choices lead to systems with or without incentives to replace their goal or utilty functions:
Self-Modification of Policy and Utility Function in Rational Agents Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. In AGI-16 and arXiv, 2016. Slides, video. Winner of the Kurzweil prize for best AGI paper.
Self-preservation and death. AIs may have an incentive not to be turned off.
There is a natural mathematical definition of death in the UAI/AIXI framework. RL agents can be suicidal:
Death and Suicide in Universal Artificial Intelligence Jarryd Martin, Tom Everitt, and Marcus Hutter. In AGI-16 and arxiv, 2016. Slides
Extending the analysis of a previous paper, we determine the exact conditions for when CIRL agents ignore a shutdown signal:
A Game-Theoretic Analysis of the Off-Switch Game Tobias Wängberg, Mikael Böörs, Elliot Catt, Tom Everitt, and Marcus Hutter In AGI-17 and arXiv, 2017.
Decision theory. Strangely, robots and other agents that are part of their environment may be able to infer properties of themselves from their own actions. For example, my having petted a lot of cats in the past may be evidence that I have toxoplasmosis, a disease which makes you fond of cats. Now, if I see a cat, should I avoid petting it to reduce the risk that I have the disease? (note that petting cats never causes toxoplasmosis). The two standard answers for how to reason in this situation are called CDT and EDT. We show that CDT and EDT turns into three possibilities for how to reason in sequential settings where multiple actions are interleaved with observations:
Sequential Extensions of Causal and Evidential Decision Theory. Tom Everitt, Jan Leike, and Marcus Hutter. In Algorithmic Decision Theory (ADT) and arXiv, 2015. Slides, source code.
Other AI safety papers. An approach to solve the wireheading problem. I now believe this approach has no benefit over TI-unaware reward modeling, described in my reward tampering paper.
Avoiding Wireheading with Value Reinforcement Learning Tom Everitt and Marcus Hutter. In AGI-16 and arXiv, 2016. Slides, video. Source code: download, view online.
Exploration A fundamental problem in reinforcement learning is how to explore an unknown environment effectively. Ideally, an exploration strategy should direct us to regions with potentially high reward, while not being too expensive to compute. In the following paper, we find a way to employ standard function approximation techniques to estimate the novelty of different actions, which gives state-of-the-art performance in the popular Atari Learning Environment while being much cheaper to compute than most alternative strategies:
Count-Based Exploration in Feature Space for Reinforcement Learning. Jarryd Martin, Suraj Narayanan S, Tom Everitt, and Marcus Hutter In IJCAI-17 and arXiv, 2017.
Background. Search and optimisation are fundamental aspects of AI and of intelligence in general. Intelligence can actually be defined as optimisation ability (Legg and Hutter, Universal Intelligence: A Definition of Machine Intelligence, 2007).
(No) Free Lunch. The No Free Lunch theorems state that intelligent optimisation is impossible without knowledge about what you're trying optimise. I argue against these theorems, and show that under a natural definition of complete uncertainty, intelligent (better-than-random) optimisation is possible. Unfortunately, I was also able to show that there are pretty strong limits on how much better intelligent search can be compared to random search.
Free Lunch for Optimisation under the Universal Distribution. Tom Everitt, Tor Lattimore, and Marcus Hutter. In IEEE Congress on Evolutionary Computation (CEC) and arXiv, 2014. Slides.
Universal Induction and Optimisation: No Free Lunch? Tom Everitt Supervised by Tor Lattimore, Peter Sunehag, and Marcus Hutter at ANU. Master thesis, Department of Mathematics, Stockholm University, 2013.
Optimisation difficulty. In a related paper, we give a formal definition of how hard a function is to optimise:
Can we measure the difficulty of an optimization problem? Tansu Alpcan, Tom Everitt, and Marcus Hutter. In IEEE Information Theory Workshop (ITW) PDF©IEEE, 2014.
How to search. Two of the most fundamental strategies for search is DFS and BFS. In DFS, you search depth-first; for example, you follow one path until its very end before trying something else. In BFS, you instead try to search as broadly as possible, focusing on breadth rather than depth. I calculate the expected search times for both methods, and derive some results on which method is preferable in which situations:
Analytical Results on the BFS vs. DFS Algorithm Selection Problem. Part I, Tree Search. Tom Everitt and Marcus Hutter. In 28th Australasian Joint Conference on AI and arXiv, 2015. Slides, Source Code.
Analytical Results on the BFS vs. DFS Algorithm Selection Problem. Part II, Graph Search. Tom Everitt and Marcus Hutter. In 28th Australasian Joint Conference on AI and arXiv, 2015. Slides, Source Code.
Analytical Algorithm Selection for AI Search: BFS vs. DFS Tom Everitt and Marcus Hutter. In preparation, 2017. Source Code.
Classification by decomposition: a novel approach to classification of symmetric 2×2 games Mikael Böörs, Tobias Wängberg, Tom Everitt & Marcus Hutter In Theory and Decision, 2021.
Automated Theorem Proving. Tom Everitt, Supervised by Rikard Bøgvad. Bachelor thesis, Department of Mathematics, Stockholm University, 2010.
Find me on Blue Sky, Twitter, Facebook, LinkedIn, Google scholar, dblp, ORCID.