Kai Wang*, Lily Xu*, Aparna Taneja, and Milind Tambe. 2/14/2023. “Optimistic Whittle Index Policy: Online Learning for Restless Bandits.” AAAI Conference on Artificial Intelligence (AAAI).Abstract
Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear $O(H \sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions in $T$ episodes with a constant horizon $H$. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including one constructed via sampling from a real-world maternal and childcare dataset.
Kai Wang*, Shresth Verma*, Aditya Mate, Sanket Shah, Aparna Taneja, Neha Madhiwalla, Aparna Hegde, and Milind Tambe. 2/14/2023. “Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health.” AAAI Conference on Artificial Intelligence (AAAI).Abstract
This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. However, prior works often learn the model by maximizing the predictive accuracy instead of final RMAB solution quality, causing a mismatch between training and evaluation objectives. To address this shortcoming, we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality. We present three key contributions: (i) we establish differentiability of the Whittle index policy to support decision-focused learning; (ii) we significantly improve the scalability of decision-focused learning approaches in sequential problems, specifically RMAB problems; (iii) we apply our algorithm to a previously collected dataset of maternal and child health to demonstrate its performance. Indeed, our algorithm is the first for decision-focused learning in RMAB that scales to real-world problem sizes.
Shresth Verma, Gargi Singh, Aditya Mate, Paritosh Verma, Sruthi Gorantala, Neha Madhiwalla, Aparna Hegde, Divy Thakkar, Manish Jain, Milind Tambe, and Aparna Taneja. 2/10/2023. “Increasing Impact of Mobile Health Programs: SAHELI for Maternal and ChildCare.” In Innovative Applications of Artificial Intelligence (IAAI). iaai_2023_armman_rmab_deployment_5.pdf
Paula Rodriguez Diaz, Jackson A Killian, Lily Xu, Arun Sai Suggala, Aparna Taneja, and Milind Tambe. 2/7/2023. “Flexible Budgets in Restless Bandits:A Primal-Dual Algorithm for Efficient Budget Allocation.” In AAAI Conference on Artificial Intelligence (AAAI). frmab_aaai23_preprint.pdf
Jackson A. Killian*, Lily Xu*, Arpita Biswas*, Shresth Verma*, Vineet Nair, Aparna Taneja, Aparna Hegde, Neha Madhiwalla, Paula Rodriguez Diaz, Sonja Johnson-Yu, and Milind Tambe. 2/2023. “Robust Planning over Restless Groups: Engagement Interventions for a Large-Scale Maternal Telehealth Program.” In AAAI Conference on Artificial Intelligence. group_robust_rmab_aaai23.pdf
Shresth Verma, Aditya Mate, Kai Wang, Aparna Taneja, and Milind Tambe. 12/10/2022. “Case Study: Applying Decision Focused Learning inthe RealWorld.” In NeurIPS 2022 workshop on Trustworthy and Socially Responsible Machine Learning. neurips_2022_trustworthy_ai_workshop_3.pdf
Sanket Shah, Kai Wang, Bryan Wilder, Andrew Perrault, and Milind Tambe. 12/2022. “Decision-Focused Learning without Differentiable Optimization: Learning Locally Optimized Decision Losses.” In Conference on Neural Information Processing Systems (NeurIPS). Vol. 36. New Orleans.Abstract
Decision-Focused Learning (DFL) is a paradigm for tailoring a predictive model to a downstream optimization task that uses its predictions in order to perform better on that specific task. The main technical challenge associated with DFL is that it requires being able to differentiate through the optimization problem, which is difficult due to discontinuous solutions and other challenges. Past work has largely gotten around this this issue by handcrafting task-specific surrogates to the original optimization problem that provide informative gradients when differentiated through. However, the need to handcraft surrogates for each new task limits the usability of DFL. In addition, there are often no guarantees about the convexity of the resulting surrogates and, as a result, training a predictive model using them can lead to inferior local optima. In this paper, we do away with surrogates altogether and instead learn loss functions that capture task-specific information. To the best of our knowledge, ours is the first approach that entirely replaces the optimization component of decision-focused learning with a loss that is automatically learned. Our approach (a) only requires access to a black-box oracle that can solve the optimization problem and is thus generalizable, and (b) can be convex by construction and so can be easily optimized over. We evaluate our approach on three resource allocation problems from the literature and find that our approach outperforms learning without taking into account task-structure in all three domains, and even hand-crafted surrogates from the literature.
Aditya Mate. 10/16/2022. “Optimization and Planning of Limited Resources for Assisting Non-Profits in Improving Maternal and Child Health.” INFORMS Doing Good with Good OR.Abstract

The maternal mortality rate in India is appalling, largely fueled by lack of access to preventive care information, especially in low resource households. We partner with non-profit, ARMMAN, that aims to use mobile health technologies to improve the maternal and child health outcomes.


To assisst ARMMAN and such non-profits, we develop a Restless Multi-Armed Bandit (RMAB) based solution to help improve accessibility of critical health information, via increased engagement of beneficiaries with their program. We address fundamental research challenges that crop up along the way and present technical advances in RMABs and Planning Algorithms for Limited-Resource Allocation. Transcending the boundaries of typical laboratory research, we also deploy our models in the field, and present results from a first-of-its-kind pilot test employing and evaluating RMABs in a real-world public health application.

Zun Li, Feiran Jia, Aditya Mate, Shahin Jabbari, Mithun Chakraborty, Milind Tambe, and Yevgeniy Vorobeychik. 8/2022. “Solving Structured Hierarchical Games Using Differential Backward Induction.” In Conference on Uncertainty in Artificial Intelligence (UAI). Eindhoven, Netherlands.Abstract
From large-scale organizations to decentralized political systems, hierarchical strategic decision making is commonplace. We introduce a novel class of structured hierarchical games (SHGs) that formally capture such hierarchical strategic interactions. In an SHG, each player is a node in a tree, and strategic choices of players are sequenced from root to leaves, with root moving first, followed by its children, then followed by their children, and so on until the leaves. A player’s utility in an SHG depends on its own decision, and on the choices of its parent and all the tree leaves. SHGs thus generalize simultaneous-move games, as well as Stackelberg games with many followers. We leverage the structure of both the sequence of player moves as well as payoff dependence to develop a gradientbased back propagation-style algorithm, which we call Differential Backward Induction (DBI), for approximating equilibria of SHGs. We provide a sufficient condition for convergence of DBI and demonstrate its efficacy in finding approximate equilibrium solutions to several SHG models of hierarchical policy-making problems.
Jackson A. Killian, Lily Xu, Arpita Biswas, and Milind Tambe. 8/2022. “Restless and Uncertain: Robust Policies for Restless Bandits via Deep Multi-Agent Reinforcement Learning.” In Uncertainty in Artificial Intelligence (UAI).Abstract
We introduce robustness in restless multi-armed bandits (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant uncertainty, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret--robust policies for RMABs. Our approach uses a double oracle framework (oracles for agent and nature), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary "λ-network" in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. This formulation is of general interest---we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains.
killian_uai_2022_restless_uncertain.pdf killian_uai_2022_restless_uncertain-supp.pdf
Lily Xu, Arpita Biswas, Fei Fang, and Milind Tambe. 7/23/2022. “Ranked Prioritization of Groups in Combinatorial Bandit Allocation.” International Joint Conference on Artificial Intelligence (IJCAI) 31. Vienna, Austria. arXiv linkAbstract
Preventing poaching through ranger patrols is critical for protecting endangered wildlife. Combinatorial bandits have been used to allocate limited patrol resources, but existing approaches overlook the fact that each location is home to multiple species in varying proportions, so a patrol benefits each species to differing degrees. When some species are more vulnerable, we ought to offer more protection to these animals; unfortunately, existing combinatorial bandit approaches do not offer a way to prioritize important species. To bridge this gap, (1) We propose a novel combinatorial bandit objective that trades off between reward maximization and also accounts for prioritization over species, which we call ranked prioritization. We show this objective can be expressed as a weighted linear sum of Lipschitz-continuous reward functions. (2) We provide RankedCUCB, an algorithm to select combinatorial actions that optimize our prioritization-based objective, and prove that it achieves asymptotic no-regret. (3) We demonstrate empirically that RankedCUCB leads to up to 38% improvement in outcomes for endangered species using real-world wildlife conservation data. Along with adapting to other challenges such as preventing illegal logging and overfishing, our no-regret algorithm addresses the general combinatorial bandit problem with a weighted linear objective.
E. Cranford, S. Jabbari, H-C. Ou, M. Tambe, C. Gonzalez, and C. Lebiere. 7/2022. “Combining Machine Learning and Cognitive Models for Adaptive Phishing Training.” In Internation Conference on Cognitive Modeling (ICCM).Abstract
Organizations typically use simulation campaigns to train employees to detect phishing emails but are non-personalized and fail to account for human experiential learning and adaptivity. We propose a method to improve the effectiveness of training by combining cognitive modeling with machine learning methods. We frame the problem as one of scheduling and use the restless multi-armed bandit (RMAB) framework to select which users to target for intervention at each trial, while using a cognitive model of phishing susceptibility to inform the parameters of the RMAB. We compare the effectiveness of the RMAB solution to two purely cognitive approaches in a series of simulation studies using the cognitive model as simulated participants. Both approaches show improvement compared to random selection and we highlight the pros and cons of each approach. We discuss the implications of these findings and future research that aims to combine the benefits of both methods for a more effective solution.
Combining Machine Learning and Cognitive Models for Adaptive Phishing Training
Vineet Nair, Kritika Prakash, Michael Wilbur, Aparna Taneja, Corrine Namblard, Oyindamola Adeyemo, Abhishek Dubey, Abiodun Adereni, Milind Tambe, and Ayan Mukhopadhyay. 7/2022. “ADVISER: AI-Driven Vaccination Intervention Optimiser for Increasing Vaccine Uptake in Nigeria.” In International Joint Conference on AI (IJCAI) 7/2022. Abstract
More than 5 million children under five years die from largely preventable or treatable medical conditions every year, with an overwhelmingly large proportion of deaths occurring in under-developed countries with low vaccination uptake. One of the United Nations’ sustainable development goals (SDG 3) aims to end preventable deaths of new-borns and children under five years of age. We focus on Nigeria, where the rate of infant mortal-ity is appalling. We collaborate with HelpMum, a large non-profit organization in Nigeria to design and optimize the allocation of heterogeneous health interventions under uncertainty to increase vaccination uptake, the first such collaboration in Nigeria. Our framework, ADVISER: AI-Driven Vaccination Intervention Optimiser, is based on an integer linear program that seeks to maximize the cumulative probability of successful vaccination. Our optimization formulation is intractable in practice. We present a heuristic approach that enables us to solve the problem for real-world use-cases. We also present theoretical bounds for the heuristic method. Finally, we show that the proposed approach out-performs baseline methods in terms of vaccination uptake through experimental evaluation. HelpMum is currently planning a pilot program based on our approach to be deployed in the largest city of Nigeria, which would be the first deployment of an AI-driven vaccination uptake program in the country and hopefully, pave the way for other data-driven programs to improve health outcomes in Nigeria.
Adam Żychowski, Jacek Mańdziuk, Elizabeth Bondi, Aravind Venugopal, Milind Tambe, and Balaraman Ravindran. 7/2022. “Evolutionary Approach to Security Games with Signaling.” International Joint Conference on AI (IJCAI) 7/2022. Publisher's Version
Aditya Mate, Arpita Biswas, Christoph Siebenbrunner, Susobhan Ghosh, and Milind Tambe. 5/2022. “Efficient Algorithms for Finite Horizon and Streaming RestlessMulti-Armed Bandit Problems.” In International Conference on Autonomous Agents and Multiagent Systems (AAMAS). streamingbandits-camready-full.pdf
Han-Ching Ou*, Christoph Siebenbrunner*, Jackson Killian, Meredith B Brooks, David Kempe, Yevgeniy Vorobeychik, and Milind Tambe. 5/2022. “Networked Restless Multi-Armed Bandits for Mobile Interventions.” In 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2022). Online. aamas_2022_network_bandit.pdf
Aida Rahmattalabi. 4/26/2022. “Towards Trustworthy and Data-Driven Social Interventions.” PhD Thesis, Computer Science, University of Southern California.Abstract
This thesis examines social interventions conducted to address societal challenges such as homelessness,
substance abuse or suicide. In most of these applications, it is challenging to purposefully
collect data. Hence, we need to rely on social (e.g., social network data) or observational data (e.g.,
administrative data) to guide our decisions. Problematically, these datasets are prone to different
statistical or societal biases. When optimized and evaluated on these data, ostensibly impartial
algorithms may result in disparate impacts across different groups. In addition, these domains
are plagued by limited resources and/or limited data which create a computational challenge with
respect to improving the delivery of these interventions. In this thesis, I investigate the interplay
of fairness and these computational challenges which I present in two parts. In the first part, I
introduce the problem of fairness in social network-based interventions where I propose to use
social network data to enhance interventions that rely on individual’s social connectedness such
as HIV/suicide prevention or community preparedness against natural disasters. I demonstrate
how biases in the social network can manifest as disparate outcomes across groups and describe
my approach to mitigate such unfairness. In the second part, I focus on fairness challenges when
data is observational. Motivated by the homelessness crisis in the U.S., I study the problem of
learning fair resource allocation policies using observational data where I develop a methodology
that handles selection bias in the data. I conclude with a critique on the fairness metrics proposed
in the literature, both causal and observational (statistical), and I present a novel causal view
that addresses the shortcomings of existing approaches. In particular, my findings shed new light
on well-known impossibility results from the fair machine learning literature.
Han-Ching Ou. 3/31/2022. “Sequential Network Planning Problems for Public Health Applications.” PhD Thesis, Computer Science, Harvard University.Abstract

In the past decade, breakthroughs of Artificial Intelligence (AI) in its multiple sub-area have made new applications in various domains possible. One typical yet essential example is the public health domain. There are many challenges for humans in our never-ending battle with diseases. Among them, problems involving harnessing data with network structures and future planning, such as disease control or resource allocation, demand effective solutions significantly. However, unfortunately, some of them are too complicated or unscalable for humans to solve optimally. This thesis tackles these challenging sequential network planning problems for the public health domain by advancing the state-of-the-art to a new level of effectiveness.

In particular, My thesis provides three main contributions to overcome the emerging challenges when applying sequential network planning problems in the public health domain, namely (1) a novel sequential network-based screening/contact tracing framework under uncertainty, (2) a novel sequential network-based mobile interventions framework, (3) theoretical analysis, algorithmic solutions and empirical experiments that shows superior performance compared to previous approaches both theoretically and empirically.

More concretely, the first part of this thesis studies the active screening problem as an emerging application for disease prevention. I introduce a new approach to modeling multi-round network-based screening/contact tracing under uncertainty. Based on the well-known network SIS model in computational epidemiology, which is applicable for many diseases, I propose a model of the multi-agent active screening problem (ACTS) and prove its NP-hardness. I further proposed the REMEDY (REcurrent screening Multi-round Efficient DYnamic agent) algorithm for solving this problem. With a time and solution quality trade-off, REMEDY has two variants, Full- and Fast-REMEDY. It is a Frank-Wolfe-style gradient descent algorithm realized by compacting the representation of belief states to represent uncertainty. As shown in the experiment conducted, Full- and Fast-REMEDY are not only being superior in controlling diseases to all the previous approaches; they are also robust to varying levels of missing
information in the social graph and budget change, thus enabling
the use of our agent to improve the current practice of real-world
screening contexts.

The second part of this thesis focuses on the scalability issue for the time horizon for the ACTS problem. Although Full-REMEDY provides excellent solution qualities, it fails to scale to large time horizons while fully considering the future effect of current interventions. Thus, I proposed a novel reinforcement learning (RL) approach based on Deep Q-Networks (DQN). Due to the nature of the ACTS problem, several challenges that the traditional RL can not handle have emerged, including (1) the combinatorial nature of the problem, (2) the need for sequential planning, and (3) the uncertainties in the infectiousness states of the population. I design several innovative adaptations in my RL approach to address the above challenges. I will introduce why and how these adaptations are made in this part.

For the third part, I introduce a novel sequential network-based mobile interventions framework. It is a restless multi-armed bandits (RMABs) with network pulling effects. In the proposed model, arms are partially recharging and connected through a graph. Pulling one arm also improves the state of neighboring arms, significantly extending the previously studied setting of fully recharging bandits with no network effects. Such network effect may arise due to regular population movements (such as commuting between home and work) for mobile intervention applications. In my thesis, I show that network effects in RMABs induce strong reward coupling that is not accounted for by existing solution methods. I also propose a new solution approach for the networked RMABs by exploiting concavity properties that arise under natural assumptions on the structure of intervention effects. In addition, I show the optimality of such a method in idealized settings and demonstrate that it empirically outperforms state-of-the-art baselines.

Kai Wang, Lily Xu, Andrew Perrault, Michael K. Reiter, and Milind Tambe. 2/22/2022. “Coordinating Followers to Reach Better Equilibria: End-to-End Gradient Descent for Stackelberg Games.” In AAAI Conference on Artificial Intelligence.Abstract
A growing body of work in game theory extends the traditional Stackelberg game to settings with one leader and multiple followers who play a Nash equilibrium. Standard approaches for computing equilibria in these games reformulate the followers' best response as constraints in the leader's optimization problem. These reformulation approaches can sometimes be effective, but make limiting assumptions on the followers' objectives and the equilibrium reached by followers, e.g., uniqueness, optimism, or pessimism. To overcome these limitations, we run gradient descent to update the leader's strategy by differentiating through the equilibrium reached by followers. Our approach generalizes to any stochastic equilibrium selection procedure that chooses from multiple equilibria, where we compute the stochastic gradient by back-propagating through a sampled Nash equilibrium using the solution to a partial differential equation to establish the unbiasedness of the stochastic gradient. Using the unbiased gradient estimate, we implement the gradient-based approach to solve three Stackelberg problems with multiple followers. Our approach consistently outperforms existing baselines to achieve higher utility for the leader.
Elizabeth Bondi, Haipeng Chen, Christopher Golden, Nikhil Behari, and Milind Tambe. 2/20/2022. “Micronutrient Deficiency Prediction via Publicly Available Satellite Data.” In Innovative Applications of Artificial Intelligence (IAAI). mnd_conference_paper_iaai_2.pdf