Substantial work has investigated balancing exploration and exploitation, but relatively little has addressed this tradeoff in the context of coordinated
multi-agent interactions. This paper introduces a class of problems in which
agents must maximize their on-line reward, a decomposable function dependent
on pairs of agent’s decisions. Unlike previous work, agents must both learn the
reward function and exploit it on-line, critical properties for a class of physicallymotivated systems, such as mobile wireless networks. This paper introduces algorithms motivated by the Distributed Constraint Optimization Problem framework
and demonstrates when, and at what cost, increasing agents’ coordination can improve the global reward on such problems.