Emma Brunskill

Recent talk that overviews my research.

My goal is to create AI systems that learn from few samples to robustly make good decisions. This is a crucial part of human intelligence that enables people to adapt and even thrive in changing environments. While scientific knowledge can help guide interventions, there remains a key need to quickly cut through the space of decision policies to find effective strategies to support people. Reinforcement learning (RL), the subfield of artificial intelligence focused on agents that learn through experience to make high utility choices, is a powerful framework for addressing these challenges. However, most recent efforts in RL have focused on settings where good or perfect simulators exist, and it is cheap and feasible to collect millions of trials. In contrast, my work focuses on the many societal applications – treating patients with advanced liver cancer or training people to be computer technicians – which both lack good simulators and are inherently data limited.

I create new reinforcement learning algorithms and conduct theoretical analysis to understand what is possible, all inspired and motivated by my applications in education and healthcare.

Reinforcement Learning with Few Samples

At the heart of RL is how an agent should actively explore through trial and error to make sequences of decisions that maximize expected outcomes even in noisy, stochastic settings. Yet for 20 years the limits of how quickly any agent can learn to make good decisions remained open, even in the simplest RL settings. Our work has now closed that question for tabular episodic settings, providing tight upper and lower bounds on the dominant term for the two most common performance criteria, regret and probably approximately correct. This work and others’ work focuses on worst-case guarantees on how well an algorithm will work in any Markov decision process.

Fortunately in practice RL works far better. My work has provided the first generic strategic exploration RL algorithm with instance-dependent bounds. In contrast to worst case bounds or approaches that require a domain expert to provide explicit information about the task structure (such as de- terministic dynamics or it being a bandit process), our algorithm provides regret bounds that match worst-case bounds and bounds that are near optimal to bounds provided for structured settings where the structure is provided as input. A key insight was that the variance of the value of next states of an optimal policy is not only core to the hardness of learning, as has been observed empirically, but also is a non-random variable during RL, since the optimal value is unknown but fixed, which enabled us to prove strong bounds as a function of this quantity.

This result is for the tabular setting. With function approximation, strong worst-case results are unknown even for linear value functions but we and others are actively pushing forward on this front.

Efficient RL is equally important in lifelong / transfer learning settings. Despite many empirically promising results, transfer across RL tasks can also do poorly, and there were few formal assurances of success. We proved the first results that lifelong learning across a series of related RL tasks could speed learning. We considered an important subclass of problems where tasks are drawn from an unknown finite subset. By transforming insights on finite sample bounds for latent variable learning, we proved PAC and regret bounds for transfer learning bandits and tabular MDPs. Later learning tasks can essentially reduce to classification and similar results are possible even given adversarial task selection. We used partial personalization across tasks to obtain significantly better news article recommendations on a dataset of 500,000 users.

Robust Decisions from Historical Data

Many educational and health settings do not yet or may never allow for online experimentation, but vast data troves about decisions made and their outcomes offer an enormous opportunity to improve future decision policies. Popular algorithms like Q-learning can operate off policy, using data collected from another behavior policy, but offer little assurance on performance in the batch data setting. Importance sampling is often overlooked due to high variance, but I used it to get accurate off policy performance estimates of new adaptive policies for Refraction, an educational game that teaches fractions. Using 11,000 learners’ data we identified a new decision policy estimated to increase engagement by 30%, which we confirmed later with 2000 learners.

To more broadly develop accurate estimators of decision policies from past data, a key challenge is to balance bias and variance. There have been many algorithmic approaches, often with good empirical success, but with known shortcomings. Doubly robust estimators from statistics provide a principled approach for combining parametric models with non-parametric estimators. However, to support good decision making in practice, we require approaches with high accuracy given available data. I introduced two new doubly robust RL estimators that more explicitly balance bias and variance. We proved both are strongly consistent and yielded estimates that could be an order of magnitude more data efficient. Indeed bias and variance tradeoffs can be crucial and we showed that policy selection choices based on unbiased estimators can be systematically suboptimal. Our doubly robust estimators have been built off extensively, but our MAGIC estimator still dominated performance in some settings in a recent 2020 bakeoff.

Policy performance estimation is generally part of a broader goal to robustly select a good policy for future use. I have introduced the first, to my knowledge, structural risk minimization style bound, providing finite sample guarantees on the performance of the selected policy relative to the best in class policy for optimal stop and select treatment problems. We now are collaborating with Miguel Hernan at Harvard to apply our methods to HIV treatment selection using historical data. More broadly, I seek to answer the critical question of when batch RL can robustly go beyond imitation learning, and investigate the preliminary evidence of substantial benefits enabled through insights from both batch and online RL.

Achieving Good Outcomes

Some of the most influential potential applications of RL involve domains with complicated objectives including fairness. Existing approaches tend to optimize for the average outcome, but doing so can be inequitable. For example, despite the benefits of adaptive math tutoring software, we showed that approaches relying on popular statistical models of student learning using parameters fit to population data may leave some students with inadequate support. In a recent Science article we suggest one framework to learn decision policies from historical data that are guaranteed to satisfy safety or fairness constraints when deployed in the future\, and we introduce an algorithm that could handle a broad class of constraints. Using this algorithm\, we could identify personalized insulin dosage policies satisfying safety constraints on low blood sugar events with only 3–5 months of patient data using an FDA approved simulator.

I have also created new online algorithms to optimize for both online performance and arm mean estimation: the latter can lead to generalizable scientific knowledge. This work has lead to new insights into how mixing the display format of fractions (pie chart or symbol) can increase task difficulty of math numberline learning as later verified in a dataset of 9000+ learners. For a particular bandit setting, optimizing for active arm estimation and cumulative regret does not actually require sacrificing performance on either: our approach asymptotically matches the minimax rates for both.

Performance is a function of the human-designed sets of actions and sensors, but sometimes these initial sets may fall short in supporting great outcomes. I have created AI algorithms to decide when and where to ask a human to add actions, or refine the state space. When the best possible performance of the current decision process can be accurately predicted, a human expert can decide whether to keep optimizing or alter the system. Surprisingly, for some linear bandits it is possible to estimate the optimal performance of any threshold policy before any policy can be computed.

I also have studied new decision spaces that may be critical to enable adaptive RL systems to provide outstanding support for human learning and performance. As one example, though flashcards are a very popular and well- studied method for fact acquisition\, using a chatbot led to a substantial increase on subjects’ item recall compared to using flashcards (d=0.5) in an optional usage study. Both used the same state-of-the-art algorithm for ordering items, highlighting the additive benefits of good decision spaces and smart algorithms.


In summary, I aim to advance the technical understanding of reinforcement learning, motivated by key societal applications. It is by weaving between foundations and applications that we will create continually improving human-AI systems worthy of the people they serve.