1,198
Views
0
CrossRef citations to date
0
Altmetric
Machine Learning

A Generalization Gap Estimation for Overparameterized Models via the Langevin Functional Variance

ORCID Icon &
Pages 1287-1295 | Received 31 May 2022, Accepted 27 Mar 2023, Published online: 17 May 2023

References

  • Adlam, B., and Pennington, J. (2020), “The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization,” in Proceedings of the 37th International Conference on Machine Learning, pp. 74–84.
  • Akaike, H. (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. DOI: 10.1109/TAC.1974.1100705.
  • Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. (2011), “KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework,” Journal of Multiple-Valued Logic & Soft Computing, 17, 255–287.
  • Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019), “On Exact Computation with an Infinitely Wide Neural Net,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8141–8150.
  • Azulay, A., and Weiss, Y. (2019), “Why Do Deep Convolutional Networks Generalize So Poorly to Small Image Transformations?” Journal of Machine Learning Research, 20, 1–25.
  • Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign Overfitting in Linear Regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. DOI: 10.1073/pnas.1907378117.
  • Belkin, M., Hsu, D., and Mitra, P. (2018), “Overfitting or Perfect Fitting? Risk Bounds for Classification and Regression Rules that Interpolate,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2306–2317.
  • Belkin, M., Hsu, D., and Xu, J. (2020), “Two Models of Double Descent for Weak Features,” SIAM Journal on Mathematics of Data Science, 2, 1167–1180. DOI: 10.1137/20M1336072.
  • Cheng, X., Chatterji, N. S., Abbasi-Yadkori, Y., Bartlett, P. L., and Jordan, M. I. (2018), “Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting,” arXiv:1805.01648.
  • Chiyuan, Z., Samy, B., Moritz, H., Benjamin, R., and Oriol, V. (2017), “Understanding Deep Learning Requires Rethinking Generalization,” in Proceedings of the 5th International Conference on Learning Representations.
  • Chizat, L., Oyallon, E., and Bach, F. (2019), “On Lazy Training in Differentiable Programming,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 2937–2947.
  • d’ Ascoli, S., Sagun, L., and Biroli, G. (2020), “Triple Descent and the Two Kinds of Overfitting: Where & Why Do They Appear?” in Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 3058–3069.
  • Eaton, M. L. (1989), “Group Invariance Applications in Statistics,” in Regional Conference Series in Probability and Statistics (Vol. 1), pp. i-v + 1–133, Institute of Mathematical Statistics.
  • Gao, T., and Jojic, V. (2016), “Degrees of Freedom in Deep Neural Networks,” in Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, pp. 232–241.
  • Goodfellow, I., Bengio, Y., and Courville, A. (2016), Deep Learning, Cambridge, MA: MIT Press.
  • Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022), “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” The Annals of Statistics, 50, 949–986. DOI: 10.1214/21-aos2133.
  • Jacot, A., Gabriel, F., and Hongler, C. (2018), “Neural Tangent Kernel: Convergence and Generalization in Neural Networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8580–8589.
  • Karakida, R., Akaho, S., and Amari, S.-i. (2019), “Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach,” in Proceedings of the 32nd International Conference on Artificial Intelligence and Statistics, pp. 1032–1041.
  • LeCun, Y., Bengio, Y., and Hinton, G. (2015), “Deep Learning,” Nature, 521, 436–444. DOI: 10.1038/nature14539.
  • Lee, J., Schoenholz, S., Pennington, J., Adlam, B., Xiao, L., Novak, R., and Sohl-Dickstein, J. (2020), “Finite Versus Infinite Neural Networks: An Empirical Study,” in Proceedings of the 34th Conference on Neural Information Processing Systems.
  • Mallows, C. L. (1973), “Some Comments on Cp,” Technometrics, 15, 661–675. DOI: 10.2307/1267380.
  • Mandt, S., Hoffman, M. D., and Blei, D. M. (2017), “Stochastic Gradient Descent as Approximate Bayesian Inference,” Journal of Machine Learning Research, 18, 1–35.
  • Mario, G., Arthur, J., Stefano, S., Franck, G., Levent, S., Steṕhane, d., Giulio, B., Cleḿent, H., and Matthieu, W. (2020), “Scaling Description of Generalization with Number of Parameters in Deep Learning,” Journal of Statistical Mechanics: Theory and Experiment, 023401.
  • Moody, J. E. (1992), “The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems,” in Proceedings of the 4th International Conference on Neural Information Processing Systems, pp. 847–854.
  • Murata, N., Yoshizawa, S., and Amari, S.-I. (1994), “Network Information Criterion–Determining the Number of Hidden Units for an Artificial Neural Network Model,” IEEE Transactions on Neural Networks, 5, 865–872. DOI: 10.1109/72.329683.
  • Nakada, R., and Imaizumi, M. (2021), “Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks,” arXiv:2103.00500.
  • Ramani, S., Blu, T., and Unser, M. (2008), “Monte-Carlo SURE: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms,” IEEE Transactions on Image Processing, 17, 1540–1554. DOI: 10.1109/TIP.2008.2001404.
  • Risken, H. (1996), Fokker-Planck Equation for Several Variables; Methods of Solution, Berlin, Heidelberg: Springer.
  • Sato, I., and Nakagawa, H. (2014), “Approximation Analysis of Stochastic Gradient Langevin Dynamics by Using Fokker-Planck Equation and Ito Process,” in Proceedings of the 31st International Conference on Machine Learning, pp. 982–990.
  • Shibata, R. (1976), “Selection of the Order of an Autoregressive Model by Akaike’s Information Criterion,” Biometrika, 63, 117–126. DOI: 10.1093/biomet/63.1.117.
  • Shibata, R. (1989), Statistical Aspects of Model Selection, pp. 215–240, New York: Springer.
  • Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002), “Bayesian Measures of Model Complexity and Fit,” (with Discussion), Journal of the Royal Statistical Society, Series B, 64, 583–639. DOI: 10.1111/1467-9868.00353.
  • Stein, C. M. (1981), “Estimation of the Mean of a Multivariate Normal Distribution,” The Annals of Statistics, 9, 1135–1151. DOI: 10.1214/aos/1176345632.
  • Takeuchi, K. (1976), “Distribution of Information Statistics and Validity Criteria of Models,” Mathematical Science, 153, 12–18.
  • Thomas, V., Pedregosa, F., Merriënboer, B., Manzagol, P.-A., Bengio, Y., and Le Roux, N. (2020), “On the Interplay Between Noise and Curvature and its Effect on Optimization and Generalization,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pp. 3503–3513.
  • Watanabe, S. (2010), “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory,” Journal of Machine Learning Research, 11, 3571–3594.
  • Yang, G., and Littwin, E. (2021), “Tensor Programs IIb: Architectural Universality Of Neural Tangent Kernel Training Dynamics,” in Proceedings of the 38th International Conference on Machine Learning, pp. 11762–11772.
  • Ye, J. (1998), “On Measuring and Correcting the Effects of Data Mining and Model Selection,” Journal of the American Statistical Association, 93, 120–131. DOI: 10.1080/01621459.1998.10474094.
  • Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021), “Understanding Deep Learning (Still) Requires Rethinking Generalization,” Communication of the ACM, 64, 107–115. DOI: 10.1145/3446776.