359
Views
0
CrossRef citations to date
0
Altmetric
Theory and Methods

Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization

, , &
Received 09 Aug 2021, Accepted 11 Jul 2023, Published online: 20 Jul 2023

References

  • Abbeel, P., and Ng, A. Y. (2004), “Apprenticeship Learning via Inverse Reinforcement Learning,” in Proceedings of the Twenty-First International Conference on Machine Learning, p. 1. DOI: 10.1145/1015330.1015430.
  • Audibert, J.-Y., and Tsybakov, A. B. (2007), “Fast Learning Rates for Plug-in Classifiers,” The Annals of Statistics, 35, 608–633. DOI: 10.1214/009053606000001217.
  • Bertsekas, D. P., and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming (Vol. 5), Belmont, MA: Athena Scientific.
  • Bhandari, J., Russo, D., and Singal, R. (2018), “A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation,” arXiv preprint arXiv:1806.02450.
  • Bishop, C. (1994), “Mixture Density Networks,” Technical Report, pp. 1–26.
  • Bradley, R. C. (2005), “Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions,” Probability Surveys, 2, 107–144. DOI: 10.1214/154957805100000104.
  • Chakraborty, B., and Moodie, E. (2013), Statistical Methods for Dynamic Treatment Regimes, New York: Springer.
  • Chen, X., and Qi, Z. (2022), “On Well-posedness and Minimax Optimal Rates of Nonparametric q-function Estimation in Off-policy Evaluation,” in International Conference on Machine Learning, PMLR, pp. 3558–3582.
  • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21, C1–C68. DOI: 10.1111/ectj.12097.
  • Chernozhukov, V., Chetverikov, D., and Kato, K. (2014), “Gaussian Approximation of Suprema of Empirical Processes,” The Annals of Statistics, 42, 1564–1597. DOI: 10.1214/14-AOS1230.
  • Degris, T., White, M., and Sutton, R. S. (2012), “Off-Policy Actor-Critic,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 179–186.
  • Ernst, D., Geurts, P., Wehenkel, L., and Littman, L. (2005), “Tree-based Batch Mode Reinforcement Learning,” Journal of Machine Learning Research, 6, 503–556.
  • Ertefaie, A., and Strawderman, R. L. (2018), “Constructing Dynamic Treatment Regimes over Indefinite Time Horizons,” Biometrika, 105, 963–977. DOI: 10.1093/biomet/asy043.
  • Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020a), “A Theoretical Analysis of Deep q-Learning,” in Learning for Dynamics and Control, PMLR, pp. 486–489.
  • Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020b), “A Theoretical Analysis of Deep q-learning,” in Learning for Dynamics and Control, PMLR, pp. 486–489.
  • Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2016), “Regularized Policy Iteration with Nonparametric Function Spaces,” The Journal of Machine Learning Research, 17, 4809–4874.
  • Feng, Y., Ren, T., Tang, Z., and Liu, Q. (2020), “Accountable Off-Policy Evaluation with Kernel Bellman Statistics,” in International Conference on Machine Learning, PMLR, pp. 3102–3111.
  • Harvey, N., Liaw, C., and Mehrabian, A. (2017), “Nearly-Tight vc-dimension Bounds for Piecewise Linear Neural Networks,” in Conference on Learning Theory, PMLR, pp. 1064–1068.
  • Hu, X., Qian, M., Cheng, B., and Cheung, Y. K. (2021), “Personalized Policy Learning using Longitudinal Mobile Health Data,” Journal of the American Statistical Association, 116, 410–420. DOI: 10.1080/01621459.2020.1785476.
  • Hubbs, C. D., Perez, H. D., Sarwar, O., Sahinidis, N. V., Grossmann, I. E., and Wassick, J. M. (2020), “Or-gym: A Reinforcement Learning Library for Operations Research Problem,” arXiv preprint arXiv:2008.06319 .
  • Hunter, D. R., and Lange, K. (2004), “A Tutorial on MM Algorithms,” The American Statistician, 8, 30–37. DOI: 10.1198/0003130042836.
  • Jiang, Z., Yang, S., and Ding, P. (2020), “Multiply Robust Estimation of Causal Effects Under Principal Ignorability,” arXiv preprint arXiv:2012.01615.
  • Kakade, S., and Langford, J. (2002), “Approximately Optimal Approximate Reinforcement Learning,” in ICML (Vol. 2), pp. 267–274.
  • Kallus, N., and Uehara, M. (2019), “Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes,” arXiv preprint arXiv:1909.05850.
  • Kallus, N., and Uehara, M. (2020), Statistically Efficient Off-Policy Policy Gradients,” in International Conference on Machine Learning, PMLR, pp. 5089–5100.
  • Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., and Faisal, A. A. (2018), “The Artificial Intelligence Clinician Learns Optimal Treatment Strategies for Sepsis in Intensive Care,” Nature Medicine, 24, 1716–1720. DOI: 10.1038/s41591-018-0213-5.
  • Kosorok, M. R., and Laber, E. B. (2019), “Precision Medicine,” Annual Review of Statistics and its Application, 6, 263–286. DOI: 10.1146/annurev-statistics-030718-105251.
  • Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020), “Conservative q-learning for Offline Reinforcement Learning,” arXiv preprint arXiv:2006.04779.
  • Laber, E. B., Lizotte, D. J., Qian, M., Pelham, W. E., and Murphy, S. A. (2014), “Dynamic Treatment Regimes: Technical Challenges and Applications,” Electronic Journal of Statistics, 8, 1225–1272. DOI: 10.1214/14-ejs920.
  • Le, H., Voloshin, C., and Yue, Y. (2019), “Batch Policy Learning Under Constraints,” in International Conference on Machine Learning, pp. 3703–3712.
  • LeCun, Y., Bengio, Y., and Hinton, G. (2015), “Deep Learning,” Nature, 521, 436–444. DOI: 10.1038/nature14539.
  • Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020), “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems,” arXiv preprint arXiv:2005.01643.
  • Li, Y. (2017), “Deep Reinforcement Learning: An Overview,” arXiv preprint arXiv:1701.07274.
  • Liao, P., Klasnja, P., and Murphy, S. (2019), “Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health,” arXiv preprint arXiv:1912.13088.
  • Liao, P., Qi, Z., and Murphy, S. (2020), “Batch Policy Learning in Average Reward Markov Decision Processes,” arXiv preprint arXiv:2007.11771.
  • Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018), “Breaking the Curse of Horizon: Infinite-Horizon Off-policy Estimation,” in Advances in Neural Information Processing Systems, pp. 5356–5366.
  • Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. (2020), “Estimating Dynamic Treatment Regimes in Mobile Health Using v-learning,” Journal of the American Statistical Association, 115, 692–706. DOI: 10.1080/01621459.2018.1537919.
  • Luedtke, A. R., and Van Der Laan, M. J. (2016), “Statistical Inference for the Mean Outcome Under a Possibly Non-unique Optimal Treatment Strategy,” Annals of Statistics, 44, 713–742.
  • Marcolino, M. S., Oliveira, J. A. Q., D’Agostino, M., Ribeiro, A. L., Alkmim, M. B. M., and Novillo-Ortiz, D. (2018), “The Impact of mhealth Interventions: Systematic Review of Systematic Reviews,” JMIR mHealth and uHealth, 6, e23. DOI: 10.2196/mhealth.8873.
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G. et al. (2015), “Human-Level Control through Deep Reinforcement Learning,” Nature, 518, 529–533. DOI: 10.1038/nature14236.
  • Murphy, S. A. (2003), “Optimal Dynamic Treatment Regimes,” Journal of the Royal Statistical Society, Series B, 65, 331–355. DOI: 10.1111/1467-9868.00389.
  • Puterman, M. L. (1994), Markov Decision Processes: Discrete Stochastic Dynamic Programming, Hoboken, NJ: Wiley.
  • Qian, M., and Murphy, S. A. (2011), “Performance Guarantees for Individualized Treatment Rules,” Annals of Statistics, 39, 1180–1210.
  • Rust, J. (1987), “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher,” Econometrica: Journal of the Econometric Society, 55, 999–1033. DOI: 10.2307/1911259.
  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015), “Trust Region Policy Optimization,” International Conference on Machine Learning, pp. 1889–1897.
  • Shi, C., Fan, A., Song, R., and Lu, W. (2018), “High-Dimensional a-learning for Optimal Dynamic Treatment Regimes,” Annals of Statistics, 46, 925. DOI: 10.1214/17-AOS1570.
  • Shi, C., Lu, W., and Song, R. (2020), “Breaking the Curse of Nonregularity with Subagging: Inference of the Mean Outcome Under Optimal Treatment Regimes,” Journal of Machine Learning Research, accepted.
  • Shi, C., Wan, R., Chernozhukov, V., and Song, R. (2021), “Deeply-Debiased Off-Policy Interval Estimation,” in Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceedings of Machine Learning Research, PMLR, pp. 9580–9591.
  • Shi, C., Wan, R., Song, R., Lu, W., and Leng, L. (2020a), “Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making,” in International Conference on Machine Learning, PMLR, pp. 8807–8817.
  • Shi, C., Zhang, S., Lu, W., and Song, R. (2020b), “Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings,” arXiv preprint arXiv:2001.04515 .
  • Shi, X., Miao, W., Nelson, J. C., and Tchetgen Tchetgen, E. J. (2020c), “Multiply Robust Causal Inference with Double-Negative Control Adjustment for Categorical Unmeasured Confounding,” Journal of the Royal Statistical Society, Series B, 82, 521–540. DOI: 10.1111/rssb.12361.
  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016), “Mastering the Game of Go with Deep Neural Networks and Tree Search,” Nature, 529, 484–489. DOI: 10.1038/nature16961.
  • Sutton, R. S., and Barto, A. G. (2018), Reinforcement Learning: An Introduction, Cabridge, MA: MIT Press.
  • Tchetgen Tchetgen, E. J., and Shpitser, I. (2012), “Semiparametric Theory for Causal Mediation Analysis: Efficiency Bounds, Multiple Robustness, and Sensitivity Analysis,” Annals of Statistics, 40, 1816.
  • Tsiatis, A. A., Davidian, M., Holloway, S. T., and Laber, E. B. (2019), Dynamic Treatment Regimes: Statistical Methods for Precision Medicine, Boca Raton, FL: CRC Press.
  • Tsybakov, A. B. (2004), “Optimal Aggregation of Classifiers in Statistical Learning,” The Annals of Statistics, 32, 135–166. DOI: 10.1214/aos/1079120131.
  • Wang, J., Qi, Z., and Wong,R. K. (2021), “Projected State-Action Balancing Weights for Offline Reinforcement Learning,” arXiv preprint arXiv:2109.04640.
  • Wang, L., and Tchetgen Tchetgen, E. (2018), “Bounded, Efficient and Multiply Robust Estimation of Average Treatment Effects Using Instrumental Variables,” Journal of the Royal Statistical Society, Series B, 80, 531–550. DOI: 10.1111/rssb.12262.
  • Wang, L., Zhou, Y., Song, R., and Sherwood, B. (2018), “Quantile-Optimal Treatment Regimes,” Journal of the American Statistical Association, 113, 1243–1254. DOI: 10.1080/01621459.2017.1330204.
  • Watkins, C. J., and Dayan, P. (1992), “Q-learning,” Machine Learning, 8, 279–292. DOI: 10.1007/BF00992698.
  • Wu, Y., Tucker, G., and Nachum, O. (2019), “Behavior Regularized Offline Reinforcement Learning,” arXiv preprint arXiv:1911.11361.
  • Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020), “Mopo: Model-based Offline Policy Optimization,” Advances in Neural Information Processing Systems, 33, 14129–14142.
  • Zhao, Y.-Q., Zeng, D., Laber, E. B., and Kosorok, M. R. (2015), “New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes,” Journal of the American Statistical Association, 110, 583–598. DOI: 10.1080/01621459.2014.937488.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.