Kai Wang, Sanket Shah, Haipeng Chen, Andrew Perrault, Finale Doshi-Velez, and Milind Tambe. 12/2021. “Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning.” In NeurIPS 2021 (spotlight).Abstract
In the predict-then-optimize framework, the objective is to train a predictive model, mapping from environment features to parameters of an optimization problem, which maximizes decision quality when the optimization is subsequently solved. Recent work on decision-focused learning shows that embedding the optimization problem in the training pipeline can improve decision quality and help generalize better to unseen tasks compared to relying on an intermediate loss function for evaluating prediction quality. We study the predict-then-optimize framework in the context of \emph{sequential} decision problems (formulated as MDPs) that are solved via reinforcement learning. In particular, we are given environment features and a set of trajectories from training MDPs, which we use to train a predictive model that generalizes to unseen test MDPs without trajectories. Two significant computational challenges arise in applying decision-focused learning to MDPs: (i) large state and action spaces make it infeasible for existing techniques to differentiate through MDP problems, and (ii) the high-dimensional policy space, as parameterized by a neural network, makes differentiating through a policy expensive. We resolve the first challenge by sampling provably unbiased derivatives to approximate and differentiate through optimality conditions, and the second challenge by using a low-rank approximation to the high-dimensional sample-based derivatives. We implement both Bellman--based and policy gradient--based decision-focused learning on three different MDP problems with missing parameters, and show that decision-focused learning performs better in generalization to unseen tasks.
Aditya Mate*, Lovish Madaan*, Aparna Taneja, Neha Madhiwalla, Shresth Verma, Gargi Singh, Aparna Hegde, Pradeep Varakantham, and Milind Tambe. 12/2021. “Restless Bandits in the Field: Real-World Study for Improving Maternal and Child Health Outcomes.” In MLPH: Machine Learning in Public Health NeurIPS 2021 Workshop.Abstract

The widespread availability of cell phones has enabled non-profits to deliver critical health information to their beneficiaries in a timely manner. This paper describes our work in assisting non-profits employing automated messaging programs to deliver timely preventive care information to new and expecting mothers during pregnancy and after delivery. Unfortunately, a key challenge in such information delivery programs is that a significant fraction of beneficiaries tend to drop out. Yet, non-profits often have limited health-worker resources (time) to place crucial service calls for live interaction with beneficiaries to prevent such engagement drops. To assist non-profits in optimizing this limited resource, we developed a Restless Multi-Armed Bandits (RMABs) system. One key technical contribution in this system is a novel clustering method of offline historical data to infer unknown RMAB parameters. Our second major contribution is evaluation of our RMAB system in collaboration with an NGO, via a real-world service quality improvement study. The study compared strategies for optimizing service calls to 23003 participants over a period of 7 weeks to reduce engagement drops. We show that the RMAB group provides statistically significant improvement over other comparison groups, reducing 30% engagement drops. To the best of our knowledge, this is the first study demonstrating the utility of RMABs in real world public health settings. We are transitioning our system to the NGO for real-world use.

Lily Xu. 10/24/2021. “Learning, Optimization, and Planning Under Uncertainty for Wildlife Conservation.” INFORMS Doing Good with Good OR.Abstract

Wildlife poaching fuels the multi-billion dollar illegal wildlife trade and pushes countless species to the brink of extinction. To aid rangers in preventing poaching in protected areas around the world, we have developed PAWS, the Protection Assistant for Wildlife Security. We present technical advances in multi-armed bandits and robust sequential decision-making using reinforcement learning, with research questions that emerged from on-the-ground challenges. We also discuss bridging the gap between research and practice, presenting results from field deployment in Cambodia and large-scale deployment through integration with SMART, the leading software system for protected area management used by over 1,000 wildlife parks worldwide.

Ramesha Karunasena, Mohammad Sarparajul Ambiya, Arunesh Sinha, Ruchit Nagar, Saachi Dalal, Divy Thakkar, Dhyanesh Narayanan, and Milind Tambe. 10/5/2021. “Measuring Data Collection Diligence for Community Healthcare.” In ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO '21). eaamo21-12_taps_output.pdf
Lily Xu. 8/21/2021. “Learning and Planning Under Uncertainty for Green Security.” 30th International Joint Conference on Artificial Intelligence (IJCAI). xu-dc-ijcai21.pdf
Arpita Biswas, Gaurav Aggarwal, Pradeep Varakantham, and Milind Tambe. 8/2021. “Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare.” In International Joint Conference on Artificial Intelligence (IJCAI).Abstract
In many public health settings, it is important for patients to adhere to health programs, such as taking medications and periodic health checks. Unfortunately, beneficiaries may gradually disengage from such programs, which is detrimental to their health. A concrete example of gradual disengagement has been observed by an organization that carries out a free automated call-based program for spreading preventive care information among pregnant women. Many women stop picking up calls after being enrolled for a few months. To avoid such disengagements, it is important to provide timely interventions. Such interventions are often expensive, and can be provided to only a small fraction of the beneficiaries. We model this scenario as a restless multi-armed bandit (RMAB) problem, where each beneficiary is assumed to transition from one state to another depending on the intervention. Moreover, since the transition probabilities are unknown a priori, we propose a Whittle index based Q-Learning mechanism and show that it converges to the optimal solution. Our method improves over existing learning-based methods for RMABs on multiple benchmarks from literature and also on the maternal healthcare dataset.
Jackson A Killian, Arpita Biswas, Sanket Shah, and Milind Tambe. 8/2021. “Q-Learning Lagrange Policies for Multi-Action Restless Bandits.” Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. maiql_kdd21_main_teamcore.pdf maiql_kdd21_online_appendix.pdf
Lily Xu, Andrew Perrault, Fei Fang, Haipeng Chen, and Milind Tambe. 7/27/2021. “Robust Reinforcement Learning Under Minimax Regret for Green Security.” Conference on Uncertainty in Artificial Intelligence (UAI).Abstract
Green security domains feature defenders who plan patrols in the face of uncertainty about the adversarial behavior of poachers, illegal loggers, and illegal fishers. Importantly, the deterrence effect of patrols on adversaries' future behavior makes patrol planning a sequential decision-making problem. Therefore, we focus on robust sequential patrol planning for green security following the minimax regret criterion, which has not been considered in the literature. We formulate the problem as a game between the defender and nature who controls the parameter values of the adversarial behavior and design an algorithm MIRROR to find a robust policy. MIRROR uses two reinforcement learning-based oracles and solves a restricted game considering limited defender strategies and parameter values. We evaluate MIRROR on real-world poaching data.
Haipeng Chen, Wei Qiu, Han-Ching Ou, Bo An, and Milind Tambe. 7/25/2021. “Contingency-Aware Influence Maximization: A Reinforcement Learning Approach.” In Conference on Uncertainty in Artificial Intelligence. uai21.pdf
Bryan Wilder. 7/15/2021. “AI for Population Health: Melding Data and Algorithms on Networks.” PhD Thesis, Computer Science, Harvard University. ai_for_population_health-_melding_data_and_algorithms_on_networks.pdf
Edward Cranford, Cleotilde Gonzalez, Palvi Aggarwal, Milind Tambe, Sarah Cooney, and Christian Lebiere. 7/7/2021. “Towards a Cognitive Theory of Cyber Deception.” Cognitive Science. Publisher's Version cogs.13013.pdf
Sushant Agarwal, Shahin Jabbari, Chirag Agarwal, Sohini Upadhyay, Zhiwei Steven Wu, and Himabindu Lakkaraju. 7/1/2021. “Towards the Unification and Robustness of Perturbation and Gradient Based Explanations.” Proceedings of the 38th International Conference on Machine Learning. Virtual Only. Publisher's VersionAbstract
As machine learning black boxes are increasingly being deployed in critical domains such as health- care and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a post hoc manner. In this work, we analyze two popular post hoc interpreta- tion techniques: SmoothGrad which is a gradient based method, and a variant of LIME which is a perturbation based method. More specifically, we derive explicit closed form expressions for the ex- planations output by these two methods and show that they both converge to the same explanation in expectation, i.e., when the number of perturbed samples used by these methods is large. We then leverage this connection to establish other desir- able properties, such as robustness, for these tech- niques. We also derive finite sample complexity bounds for the number of perturbations required for these methods to converge to their expected explanation. Finally, we empirically validate our theory using extensive experimentation on both synthetic and real world datasets.
Towards the Unification and Robustness of Perturbation and Gradient Based Explanations
Elizabeth Bondi, Catherine Ressijac, and Peter Boucher. 6/25/2021. “Preliminary Detection of Rhino Middens for Understanding Rhino Behavior.” CVPR 2021 Workshop on Computer Vision for Animal Behavior Tracking and Modeling. middens_cv4animals.pdf
Eric Rice, Bryan Wilder, Laura Onasch-Vera, Graham Diguiseppi, Robin Petering, Chyna Hill, Amulya Yadav, Sung-Jae Lee, and Milind Tambe. 6/21/2021. “A Peer-Led, Artificial Intelligence-Augmented Social Network Intervention to Prevent HIV among Youth Experiencing Homelessness.” to appear in the Journal of Acquired Immune Deficiency Syndrome (JAIDS).Abstract
Youth experiencing homelessness (YEH) are at elevated risk for HIV/AIDS and disproportionately identify as racial, ethnic, sexual, and gender minorities. We developed a new peer change agent (PCA) HIV prevention intervention with three arms: (1) an arm using an Artificial Intelligence (AI) planning algorithm to select PCAs; (2) a popularity arm, the standard PCA approach, operationalized as highest degree centrality (DC); and (3) an observation-only comparison group.
Arpita Biswas and Suvam Mukherjee. 5/19/2021. “Ensuring Fairness under Prior Probability Shifts.” In AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES). AAAI/ACM. arXiv
Elizabeth Bondi*, Lily Xu*, Diana Acosta-Navas, and Jackson A. Killian. 5/19/2021. “Envisioning Communities: A Participatory Approach Towards AI for Social Good.” In AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES). arXiv LinkAbstract
Research in artificial intelligence (AI) for social good presupposes some definition of social good, but potential definitions have been seldom suggested and never agreed upon. The normative question of what AI for social good research should be "for" is not thoughtfully elaborated, or is frequently addressed with a utilitarian outlook that prioritizes the needs of the majority over those who have been historically marginalized, brushing aside realities of injustice and inequity. We argue that AI for social good ought to be assessed by the communities that the AI system will impact, using as a guide the capabilities approach, a framework to measure the ability of different policies to improve human welfare equity. Furthermore, we lay out how AI research has the potential to catalyze social progress by expanding and equalizing capabilities. We show how the capabilities approach aligns with a participatory approach for the design and implementation of AI for social good research in a framework we introduce called PACT, in which community members affected should be brought in as partners and their input prioritized throughout the project. We conclude by providing an incomplete set of guiding questions for carrying out such participatory AI research in a way that elicits and respects a community's own definition of social good.
Arpita Biswas, Gaurav Aggarwal, Pradeep Varakantham, and Milind Tambe. 5/7/2021. “Learning Index Policies for Restless Bandits with Application to Maternal Healthcare (Extended abstract).” In International Conference on Autonomous Agents and Multiagent Systems (AAMAS). p1467.pdf
Aditya Mate, Andrew Perrault, and Milind Tambe. 5/7/2021. “Risk-Aware Interventions in Public Health: Planning with Restless Multi-Armed Bandits.” In 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). London, UK.Abstract
Community Health Workers (CHWs) form an important component of health-care systems globally, especially in low-resource settings. CHWs are often tasked with monitoring the health of and intervening on their patient cohort. Previous work has developed several classes of Restless Multi-Armed Bandits (RMABs) that are computationally tractable and indexable, a condition that guarantees asymptotic optimality, for solving such health monitoring and intervention problems (HMIPs).
However, existing solutions to HMIPs fail to account for risk-sensitivity considerations of CHWs in the planning stage and may run the danger of ignoring some patients completely because they are deemed less valuable to intervene on.
Additionally, these also rely on patients reporting their state of adherence accurately when intervened upon. Towards tackling these issues, our contributions in this paper are as follows: 
(1) We develop an RMAB solution to HMIPs that allows for reward functions that are monotone increasing, rather than linear, in the belief state and also supports a wider class of observations.
(2) We prove theoretical guarantees on the asymptotic optimality of our algorithm for any arbitrary reward function. Additionally, we show that for the specific reward function considered in previous work, our theoretical conditions are stronger than the state-of-the-art guarantees.
(3) We show the applicability of these new results for addressing the three issues pertaining to: risk-sensitive planning, equitable allocation and reliance on perfect observations as highlighted above. We evaluate these techniques on both simulated as well as real data from a prevalent CHW task of monitoring adherence of tuberculosis patients to their prescribed medication in Mumbai, India and show improved performance over the state-of-the-art. The simulation code is available at: https://github.com/AdityaMate/risk-aware-bandits.
Aravind Venugopal, Elizabeth Bondi, Harshavardhan Kamarthi, Keval Dholakia, Balaraman Ravindran, and Milind Tambe. 5/5/2021. “Reinforcement Learning for Unified Allocation and Patrolling in Signaling Games with Uncertainty.” In 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). combsgpo_venugopal_aamas_2021.pdf
Lily Xu, Andrew Perrault, Fei Fang, Haipeng Chen, and Milind Tambe. 5/5/2021. “Robustness in Green Security: Minimax Regret Optimality with Reinforcement Learning.” AAMAS Workshop on Autonomous Agents for Social Good. robust_rl_aamas_aasg.pdf