Statistical Decision Making (SDM) lab’s paper is accepted to 35th Conference on Neural Information Processing Systems (NeurIPS) 2021, one of the top-3 conferences for artificial intelligence and machine learning.
A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. The dependence of the arm choice on the past context and reward pairs compounds the complexity of regret analysis. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts (LinTS).
The proposed algorithm enjoys an improved regret bound compared to LinTS. Also, this is the first regret bound of LinTS that is expressed in terms of the minimum eigenvalue of the covariance matrix of contexts instead of the dimension.
“Doubly Robust Thompson Sampling with Linear Payoffs” by Wonyoung Kim, Gi-Soo Kim, Myunghee Cho Paik.