References
- Agrawal, S., and Goyal, N. (2013), “Thompson Sampling for Contextual Bandits With Linear Payoffs,” in International Conference on Machine Learning, pp. 127–135.
- Audibert, J.-Y., and Tsybakov, A. B. (2007), “Fast Learning Rates for Plug-In Classifiers,” The Annals of Statistics, 35, 608–633. DOI: https://doi.org/10.1214/009053606000001217.
- Auer, P. (2002), “Using Confidence Bounds for Exploitation-Exploration Trade-Offs,” Journal of Machine Learning Research, 3, 397–422.
- Bastani, H., and Bayati, M. (2015), “Online Decision-Making With High-Dimensional Covariates,” available at SSRN: https://ssrn.com/abstract=2661896 or DOI: https://doi.org/http://dx.doi.org/10.2139/ssrn.2661896.
- Chambaz, A., Zheng, W., and van der Laan, M. J. (2017), “Targeted Sequential Design for Targeted Learning Inference of the Optimal Treatment Rule and Its Mean Reward,” The Annals of Statistics, 45, 2537. DOI: https://doi.org/10.1214/16-AOS1534.
- Chen, H., Lu, W., and Song, R. (2020), “Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting,” Journal of the American Statistical Association (just-accepted).
- Chen, X., Lee, J. D., Tong, X. T., and Zhang, Y. (2016), “Statistical Inference for Model Parameters in Stochastic Gradient Descent,” arXiv no. 1610.08637.
- Dani, V., Hayes, T. P., and Kakade, S. M. (2008), “Stochastic Linear Optimization Under Bandit Feedback,” in Proceedings of the Workshop on Computational Learning Theory, pp. 355–366.
- Fang, Y., Xu, J., and Yang, L. (2018), “Online Bootstrap Confidence Intervals for the Stochastic Gradient Descent Estimator,” The Journal of Machine Learning Research, 19, 3053–3073.
- Goldenshluger, A., and Zeevi, A. (2013), “A Linear Response Bandit Problem,” Stochastic Systems, 3, 230–261. DOI: https://doi.org/10.1287/11-SSY032.
- Hall, P., and Heyde, C. C. (1980), Martingale Limit Theory and Its Application, New York: Academic Press.
- Kim, E. S., Herbst, R. S., Wistuba, I. I., Lee, J. J., Blumenschein, G. R., Tsao, A., Stewart, D. J., Hicks, M. E., Erasmus, J., Gupta, S., and Alden, C. M. (2011), “The BATTLE Trial: Personalizing Therapy for Lung Cancer,” Cancer Discovery, 1, 44–53. DOI: https://doi.org/10.1158/2159-8274.CD-10-0010.
- Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010), “A Contextual-Bandit Approach to Personalized News Article Recommendation,” in Proceedings of the 19th International Conference on World Wide Web, ACM, pp. 661–670. DOI: https://doi.org/10.1145/1772690.1772758.
- Luedtke, A. R., and Van Der Laan, M. J. (2016), “Statistical Inference for the Mean Outcome Under a Possibly Non-Unique Optimal Treatment Strategy,” The Annals of Statistics, 44, 713. DOI: https://doi.org/10.1214/15-AOS1384.
- Moulines, E., and Bach, F. R. (2011), “Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning,” in Advances in Neural Information Processing Systems, pp. 451–459.
- Polyak, B. T., and Juditsky, A. B. (1992), “Acceleration of Stochastic Approximation by Averaging,” SIAM Journal on Control and Optimization, 30, 838–855. DOI: https://doi.org/10.1137/0330046.
- Qian, W., and Yang, Y. (2016), “Kernel Estimation and Model Combination in a Bandit Problem With Covariates,” The Journal of Machine Learning Research, 17, 5181–5217.
- Qiang, S., and Bayati, M. (2016), “Dynamic Pricing With Demand Covariates,” available at SSRN 2765257.
- Robbins, H. (1952), “Some Aspects of the Sequential Design of Experiments,” Bulletin of the American Mathematical Society, 58, 527–535. DOI: https://doi.org/10.1090/S0002-9904-1952-09620-8.
- Robbins, H., and Monro, S. (1951), “A Stochastic Approximation Method,” The Annals of Mathematical Statistics, 22, 400–407. DOI: https://doi.org/10.1214/aoms/1177729586.
- Ruppert, D. (1988), “Efficient Estimations From a Slowly Convergent Robbins–Monro Process,” Technical Report, Cornell University Operations Research and Industrial Engineering.
- Sutton, R. S., and Barto, A. G. (2018), Reinforcement Learning: An Introduction, Cambridge, MA: MIT Press.
- Sutton, R. S., Mahmood, A. R., and White, M. (2016), “An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning,” The Journal of Machine Learning Research, 17, 2603–2631.
- Tewari, A., and Murphy, S. A. (2017), “From Ads to Interventions: Contextual Bandits in Mobile Health,” in Mobile Health, eds. J. Rehg, S. Murphy, and S. Kumar, Cham: Springer, pp. 495–517.
- Tsiatis, A. A., Davidian, M., Holloway, S. T., and Laber, E. B. (2019), Introduction to Dynamic Treatment Regimes: Statistical Methods for Precision Medicine, Boca Raton, FL: Chapman & Hall.
- Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013), “Finite-Time Analysis of Kernelised Contextual Bandits,” arXiv no. 1309.6869.
- Woodroofe, M. (1979), “A One-Armed Bandit Problem With a Concomitant Variable,” Journal of the American Statistical Association, 74, 799–806. DOI: https://doi.org/10.1080/01621459.1979.10481033.
- Yang, Y., and Zhu, D. (2002), “Randomized Allocation With Nonparametric Estimation for a Multi-Armed Bandit Problem With Covariates,” The Annals of Statistics, 30, 100–121. DOI: https://doi.org/10.1214/aos/1015362186.
- Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012), “A Robust Method for Estimating Optimal Treatment Regimes,” Biometrics, 68, 1010–1018. DOI: https://doi.org/10.1111/j.1541-0420.2012.01763.x.