Search in:

Advanced search

Journal of the American Statistical Association Latest Articles

Submit an article Journal homepage

359

Views

CrossRef citations to date

Altmetric

Theory and Methods

Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization

Chengchun Shia Department of Statistics, London School of Economics and Political Science, London, UKView further author information

Zhengling Qib Department of Decision Sciences, The George Washington University, Washington, DCCorrespondence[email protected]
View further author information

Jianing Wangc School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, ChinaView further author information

Fan Zhouc School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, ChinaCorrespondence[email protected]
View further author information

Received 09 Aug 2021, Accepted 11 Jul 2023, Published online: 20 Jul 2023

Cite this article
https://doi.org/10.1080/01621459.2023.2238942
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions

References

Abbeel, P., and Ng, A. Y. (2004), “Apprenticeship Learning via Inverse Reinforcement Learning,” in Proceedings of the Twenty-First International Conference on Machine Learning, p. 1. DOI: 10.1145/1015330.1015430.
Google Scholar
Audibert, J.-Y., and Tsybakov, A. B. (2007), “Fast Learning Rates for Plug-in Classifiers,” The Annals of Statistics, 35, 608–633. DOI: 10.1214/009053606000001217.
Web of Science ®Google Scholar
Bertsekas, D. P., and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming (Vol. 5), Belmont, MA: Athena Scientific.
Google Scholar
Bhandari, J., Russo, D., and Singal, R. (2018), “A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation,” arXiv preprint arXiv:1806.02450.
Google Scholar
Bishop, C. (1994), “Mixture Density Networks,” Technical Report, pp. 1–26.
Google Scholar
Bradley, R. C. (2005), “Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions,” Probability Surveys, 2, 107–144. DOI: 10.1214/154957805100000104.
Google Scholar
Chakraborty, B., and Moodie, E. (2013), Statistical Methods for Dynamic Treatment Regimes, New York: Springer.
Google Scholar
Chen, X., and Qi, Z. (2022), “On Well-posedness and Minimax Optimal Rates of Nonparametric q-function Estimation in Off-policy Evaluation,” in International Conference on Machine Learning, PMLR, pp. 3558–3582.
Google Scholar
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21, C1–C68. DOI: 10.1111/ectj.12097.
Web of Science ®Google Scholar
Chernozhukov, V., Chetverikov, D., and Kato, K. (2014), “Gaussian Approximation of Suprema of Empirical Processes,” The Annals of Statistics, 42, 1564–1597. DOI: 10.1214/14-AOS1230.
Web of Science ®Google Scholar
Degris, T., White, M., and Sutton, R. S. (2012), “Off-Policy Actor-Critic,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 179–186.
Google Scholar
Ernst, D., Geurts, P., Wehenkel, L., and Littman, L. (2005), “Tree-based Batch Mode Reinforcement Learning,” Journal of Machine Learning Research, 6, 503–556.
Web of Science ®Google Scholar
Ertefaie, A., and Strawderman, R. L. (2018), “Constructing Dynamic Treatment Regimes over Indefinite Time Horizons,” Biometrika, 105, 963–977. DOI: 10.1093/biomet/asy043.
Web of Science ®Google Scholar
Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020a), “A Theoretical Analysis of Deep q-Learning,” in Learning for Dynamics and Control, PMLR, pp. 486–489.
Google Scholar
Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020b), “A Theoretical Analysis of Deep q-learning,” in Learning for Dynamics and Control, PMLR, pp. 486–489.
Google Scholar
Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2016), “Regularized Policy Iteration with Nonparametric Function Spaces,” The Journal of Machine Learning Research, 17, 4809–4874.
Web of Science ®Google Scholar
Feng, Y., Ren, T., Tang, Z., and Liu, Q. (2020), “Accountable Off-Policy Evaluation with Kernel Bellman Statistics,” in International Conference on Machine Learning, PMLR, pp. 3102–3111.
Google Scholar
Harvey, N., Liaw, C., and Mehrabian, A. (2017), “Nearly-Tight vc-dimension Bounds for Piecewise Linear Neural Networks,” in Conference on Learning Theory, PMLR, pp. 1064–1068.
Google Scholar
Hu, X., Qian, M., Cheng, B., and Cheung, Y. K. (2021), “Personalized Policy Learning using Longitudinal Mobile Health Data,” Journal of the American Statistical Association, 116, 410–420. DOI: 10.1080/01621459.2020.1785476.
PubMed Web of Science ®Google Scholar
Hubbs, C. D., Perez, H. D., Sarwar, O., Sahinidis, N. V., Grossmann, I. E., and Wassick, J. M. (2020), “Or-gym: A Reinforcement Learning Library for Operations Research Problem,” arXiv preprint arXiv:2008.06319 .
Google Scholar
Hunter, D. R., and Lange, K. (2004), “A Tutorial on MM Algorithms,” The American Statistician, 8, 30–37. DOI: 10.1198/0003130042836.
Google Scholar
Jiang, Z., Yang, S., and Ding, P. (2020), “Multiply Robust Estimation of Causal Effects Under Principal Ignorability,” arXiv preprint arXiv:2012.01615.
Google Scholar
Kakade, S., and Langford, J. (2002), “Approximately Optimal Approximate Reinforcement Learning,” in ICML (Vol. 2), pp. 267–274.
Google Scholar
Kallus, N., and Uehara, M. (2019), “Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes,” arXiv preprint arXiv:1909.05850.
Google Scholar
Kallus, N., and Uehara, M. (2020), Statistically Efficient Off-Policy Policy Gradients,” in International Conference on Machine Learning, PMLR, pp. 5089–5100.
Google Scholar
Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., and Faisal, A. A. (2018), “The Artificial Intelligence Clinician Learns Optimal Treatment Strategies for Sepsis in Intensive Care,” Nature Medicine, 24, 1716–1720. DOI: 10.1038/s41591-018-0213-5.
PubMed Web of Science ®Google Scholar
Kosorok, M. R., and Laber, E. B. (2019), “Precision Medicine,” Annual Review of Statistics and its Application, 6, 263–286. DOI: 10.1146/annurev-statistics-030718-105251.
PubMed Web of Science ®Google Scholar
Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020), “Conservative q-learning for Offline Reinforcement Learning,” arXiv preprint arXiv:2006.04779.
Google Scholar
Laber, E. B., Lizotte, D. J., Qian, M., Pelham, W. E., and Murphy, S. A. (2014), “Dynamic Treatment Regimes: Technical Challenges and Applications,” Electronic Journal of Statistics, 8, 1225–1272. DOI: 10.1214/14-ejs920.
PubMed Web of Science ®Google Scholar
Le, H., Voloshin, C., and Yue, Y. (2019), “Batch Policy Learning Under Constraints,” in International Conference on Machine Learning, pp. 3703–3712.
Google Scholar
LeCun, Y., Bengio, Y., and Hinton, G. (2015), “Deep Learning,” Nature, 521, 436–444. DOI: 10.1038/nature14539.
PubMed Web of Science ®Google Scholar
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020), “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems,” arXiv preprint arXiv:2005.01643.
Google Scholar
Li, Y. (2017), “Deep Reinforcement Learning: An Overview,” arXiv preprint arXiv:1701.07274.
Google Scholar
Liao, P., Klasnja, P., and Murphy, S. (2019), “Off-Policy Estimation of Long-Term Average Outcomes with Applications to Mobile Health,” arXiv preprint arXiv:1912.13088.
Google Scholar
Liao, P., Qi, Z., and Murphy, S. (2020), “Batch Policy Learning in Average Reward Markov Decision Processes,” arXiv preprint arXiv:2007.11771.
Google Scholar
Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018), “Breaking the Curse of Horizon: Infinite-Horizon Off-policy Estimation,” in Advances in Neural Information Processing Systems, pp. 5356–5366.
Google Scholar
Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. (2020), “Estimating Dynamic Treatment Regimes in Mobile Health Using v-learning,” Journal of the American Statistical Association, 115, 692–706. DOI: 10.1080/01621459.2018.1537919.
PubMed Web of Science ®Google Scholar
Luedtke, A. R., and Van Der Laan, M. J. (2016), “Statistical Inference for the Mean Outcome Under a Possibly Non-unique Optimal Treatment Strategy,” Annals of Statistics, 44, 713–742.
PubMed Web of Science ®Google Scholar
Marcolino, M. S., Oliveira, J. A. Q., D’Agostino, M., Ribeiro, A. L., Alkmim, M. B. M., and Novillo-Ortiz, D. (2018), “The Impact of mhealth Interventions: Systematic Review of Systematic Reviews,” JMIR mHealth and uHealth, 6, e23. DOI: 10.2196/mhealth.8873.
PubMed Web of Science ®Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G. et al. (2015), “Human-Level Control through Deep Reinforcement Learning,” Nature, 518, 529–533. DOI: 10.1038/nature14236.
PubMed Web of Science ®Google Scholar
Murphy, S. A. (2003), “Optimal Dynamic Treatment Regimes,” Journal of the Royal Statistical Society, Series B, 65, 331–355. DOI: 10.1111/1467-9868.00389.
Google Scholar
Puterman, M. L. (1994), Markov Decision Processes: Discrete Stochastic Dynamic Programming, Hoboken, NJ: Wiley.
Google Scholar
Qian, M., and Murphy, S. A. (2011), “Performance Guarantees for Individualized Treatment Rules,” Annals of Statistics, 39, 1180–1210.
PubMed Web of Science ®Google Scholar
Rust, J. (1987), “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher,” Econometrica: Journal of the Econometric Society, 55, 999–1033. DOI: 10.2307/1911259.
Web of Science ®Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015), “Trust Region Policy Optimization,” International Conference on Machine Learning, pp. 1889–1897.
Google Scholar
Shi, C., Fan, A., Song, R., and Lu, W. (2018), “High-Dimensional a-learning for Optimal Dynamic Treatment Regimes,” Annals of Statistics, 46, 925. DOI: 10.1214/17-AOS1570.
PubMed Web of Science ®Google Scholar
Shi, C., Lu, W., and Song, R. (2020), “Breaking the Curse of Nonregularity with Subagging: Inference of the Mean Outcome Under Optimal Treatment Regimes,” Journal of Machine Learning Research, accepted.
Google Scholar
Shi, C., Wan, R., Chernozhukov, V., and Song, R. (2021), “Deeply-Debiased Off-Policy Interval Estimation,” in Proceedings of the 38th International Conference on Machine Learning, Vol. 139 of Proceedings of Machine Learning Research, PMLR, pp. 9580–9591.
Google Scholar
Shi, C., Wan, R., Song, R., Lu, W., and Leng, L. (2020a), “Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making,” in International Conference on Machine Learning, PMLR, pp. 8807–8817.
Google Scholar
Shi, C., Zhang, S., Lu, W., and Song, R. (2020b), “Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings,” arXiv preprint arXiv:2001.04515 .
Google Scholar
Shi, X., Miao, W., Nelson, J. C., and Tchetgen Tchetgen, E. J. (2020c), “Multiply Robust Causal Inference with Double-Negative Control Adjustment for Categorical Unmeasured Confounding,” Journal of the Royal Statistical Society, Series B, 82, 521–540. DOI: 10.1111/rssb.12361.
Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016), “Mastering the Game of Go with Deep Neural Networks and Tree Search,” Nature, 529, 484–489. DOI: 10.1038/nature16961.
PubMed Web of Science ®Google Scholar
Sutton, R. S., and Barto, A. G. (2018), Reinforcement Learning: An Introduction, Cabridge, MA: MIT Press.
Google Scholar
Tchetgen Tchetgen, E. J., and Shpitser, I. (2012), “Semiparametric Theory for Causal Mediation Analysis: Efficiency Bounds, Multiple Robustness, and Sensitivity Analysis,” Annals of Statistics, 40, 1816.
PubMed Web of Science ®Google Scholar
Tsiatis, A. A., Davidian, M., Holloway, S. T., and Laber, E. B. (2019), Dynamic Treatment Regimes: Statistical Methods for Precision Medicine, Boca Raton, FL: CRC Press.
Google Scholar
Tsybakov, A. B. (2004), “Optimal Aggregation of Classifiers in Statistical Learning,” The Annals of Statistics, 32, 135–166. DOI: 10.1214/aos/1079120131.
Web of Science ®Google Scholar
Wang, J., Qi, Z., and Wong,R. K. (2021), “Projected State-Action Balancing Weights for Offline Reinforcement Learning,” arXiv preprint arXiv:2109.04640.
Google Scholar
Wang, L., and Tchetgen Tchetgen, E. (2018), “Bounded, Efficient and Multiply Robust Estimation of Average Treatment Effects Using Instrumental Variables,” Journal of the Royal Statistical Society, Series B, 80, 531–550. DOI: 10.1111/rssb.12262.
Google Scholar
Wang, L., Zhou, Y., Song, R., and Sherwood, B. (2018), “Quantile-Optimal Treatment Regimes,” Journal of the American Statistical Association, 113, 1243–1254. DOI: 10.1080/01621459.2017.1330204.
PubMed Web of Science ®Google Scholar
Watkins, C. J., and Dayan, P. (1992), “Q-learning,” Machine Learning, 8, 279–292. DOI: 10.1007/BF00992698.
Web of Science ®Google Scholar
Wu, Y., Tucker, G., and Nachum, O. (2019), “Behavior Regularized Offline Reinforcement Learning,” arXiv preprint arXiv:1911.11361.
Google Scholar
Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020), “Mopo: Model-based Offline Policy Optimization,” Advances in Neural Information Processing Systems, 33, 14129–14142.
Google Scholar
Zhao, Y.-Q., Zeng, D., Laber, E. B., and Kosorok, M. R. (2015), “New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes,” Journal of the American Statistical Association, 110, 583–598. DOI: 10.1080/01621459.2014.937488.
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date