2,504
Views
3
CrossRef citations to date
0
Altmetric
Theory and Methods

Consistent Sparse Deep Learning: Theory and Computation

, &
Pages 1981-1995 | Received 20 Oct 2019, Accepted 20 Feb 2021, Published online: 20 Apr 2021

References

  • Alvarez, J. M., and Salzmann, M. (2016), “Learning the Number of Neurons in Deep Networks,” in Advances in Neural Information Processing Systems, pp. 2270–2278.
  • Bauler, B., and Kohler, M. (2019), “On Deep Learning as a Remedy for the Curse of Dimensionality in Nonparametric Regression,” The Annals of Statistics, 47, 2261–2285.
  • Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015), “Weight Uncertainty in Neural Networks,” in Proceedings of the 32nd International Conference on Machine Learning, ICML’15 (Vol. 37), JMLR.org, pp. 1613–1622.
  • Bölcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. (2019), “Optimal Approximation With Sparsely Connected Deep Neural Networks,” SIAM Journal on Mathematics of Data Science, 1, 8–45. DOI: 10.1137/18M118709X.
  • Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2016), “Entropy-SGD: Biasing Gradient Descent Into Wide Valleys,” arXiv no. 1611.01838.
  • Cheng, Y., Yu, F. X., Feris, R. S., Kumar, S., Choudhary, A. N., and Chang, S.-F. (2015), “An Exploration of Parameter Redundancy in Deep Networks With Circulant Projections,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2857–2865. DOI: 10.1109/ICCV.2015.327.
  • Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and de Freitas, N. (2013), “Predicting Parameters in Deep Learning,” in NIPS.
  • Dettmers, T., and Zettlemoyer, L. (2019), “Sparse Networks From Scratch: Faster Training Without Losing Performance,” arXiv no. 1907.04840.
  • Dobra, A., Hans, C., Jones, B., Nevins, J. R., Yao, G., and West, M. (2004), “Sparse Graphical Models for Exploring Gene Expression Data,” Journal of Multivariate Analysis, 90, 196–212. DOI: 10.1016/j.jmva.2004.02.009.
  • Feng, J., and Simon, N. (2017), “Sparse-Input Neural Networks for High-Dimensional Nonparametric Regression and Classification,” arXiv no. 1711.07592.
  • Frankle, J., and Carbin, M. (2018), “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks,” arXiv no. 1803.03635.
  • George, E. I., and McCulloch, R. E. (1993), “Variable Selection via Gibbs Sampling,” Journal of the American Statistical Association, 88, 881–889. DOI: 10.1080/01621459.1993.10476353.
  • George, E. I., and McCulloch, R. E. (1997), “Approaches for Bayesian Variable Selection,” Statistica Sinica, 7, 339–373.
  • Ghosal, S., Ghosh, J. K., and Van Der Vaart, A. W. (2000), “Convergence Rates of Posterior Distributions,” The Annals of Statistics, 28, 500–531. DOI: 10.1214/aos/1016218228.
  • Ghosh, S., and Doshi-Velez, F. (2017), “Model Selection in Bayesian Neural Networks via Horseshoe Priors,” arXiv no. 1705.10388.
  • Glorot, X., and Bengio, Y. (2010), “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256.
  • Glorot, X., Bordes, A., and Bengio, Y. (2011), “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323.
  • Gomez, A. N., Zhang, I., Kamalakara, S. R., Madaan, D., Swersky, K., Gal, Y., and Hinton, G. E. (2019), “Learning Sparse Networks Using Targeted Dropout,” arXiv no. 1905.13678.
  • Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017), “On Calibration of Modern Neural Networks,” in Proceedings of the 34th International Conference on Machine Learning, ICML’17 (Vol. 70), JMLR.org, pp. 1321–1330.
  • Han, S., Mao, H., and Dally, W. J. (2015), “Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization and Huffman Coding,” arXiv no. 1510.00149.
  • Han, S., Pool, J., Tran, J., and Dally, W. (2015), “Learning Both Weights and Connections for Efficient Neural Network,” in Advances in Neural Information Processing Systems, pp. 1135–1143.
  • He, K., Zhang, X., Ren, S., and Sun, J. (2015), “Delving Deep Into Rectifiers: Surpassing Human-Level Performance on imagenet Classification,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034.
  • He, K., Zhang, X., Ren, S., and Sun, J. (2016), “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778.
  • Ishwaran, H., and Rao, J. S. (2005), “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies,” The Annals of Statistics, 33, 730–773. DOI: 10.1214/009053604000001147.
  • Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. (2018), “Averaging Weights Leads to Wider Optima and Better Generalization,” arXiv no. 1803.05407.
  • Jiang, W. (2007), “Bayesian Variable Selection for High Dimensional Generalized Linear Models: Convergence Rate of the Fitted Densities,” The Annals of Statistics, 35, 1487–1511. DOI: 10.1214/009053607000000019.
  • Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999), “Introduction to Variational Methods for Graphical Models,” Machine Learning, 37, 183–233. DOI: 10.1023/A:1007665907178.
  • Kass, R., Tierney, L., and Kadane, J. (1990), “The Validity of Posterior Expansions Based on Laplace’s Method,” in Bayesian and Likelihood Methods in Statistics and Econometrics, eds. S. Geisser, J. Hodges, S. Press, and A. Zellner, Amsterdam: North-Holland (Elsevier Science Publisher B.V.), pp. 473–488.
  • Kingma, D. P., and Ba, J. (2014), “Adam: A Method for Stochastic Optimization,” arXiv no. 1412.6980.
  • Kleinberg, R., Li, Y., and Yuan, Y. (2018), “An Alternative View: When Does SGD Escape Local Minima?” in Proceedings of the 35th International Conference on Machine Learning, ICML’18 (Vol. 70), JMLR.org.
  • Kohn, R., Smith, M., and Chan, D. (2001), “Nonparametric Regression Using Linear Combinations of Basis Functions,” Statistics and Computing, 11, 313–322. DOI: 10.1023/A:1011916902934.
  • Krizhevsky, A. and Hinton, G. (2009), “Learning Multiple Layers of Features From Tiny Images,” Tech. Rep., Citeseer.
  • Liang, F. (2005), “Evidence Evaluation for Bayesian Neural Networks Using Contour Monte Carlo,” Neural Computation, 17, 1385–1410. DOI: 10.1162/0899766053630323.
  • Liang, F., Li, Q., and Zhou, L. (2018), “Bayesian Neural Networks for Selection of Drug Sensitive Genes,” Journal of the American Statistical Association, 113, 955–972. DOI: 10.1080/01621459.2017.1409122.
  • Liang, F., Song, Q., and Yu, K. (2013), “Bayesian Subset Modeling for High Dimensional Generalized Linear Models,” Journal of the American Statistical Association, 108, 589–606. DOI: 10.1080/01621459.2012.761942.
  • Lin, T., Stich, S. U., Barba, L., Dmitriev, D., and Jaggi, M. (2020), “Dynamic Model Pruning With Feedback,” in International Conference on Learning Representations (ICLR).
  • Liu, B., Wang, M., Foroosh, H., Tappen, M., and Pensky, M. (2015), “Sparse Convolutional Neural Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814.
  • Louizos, C., Ullrich, K., and Welling, M. (2017), “Bayesian Compression for Deep Learning,” in Advances in Neural Information Processing Systems, pp. 3288–3298.
  • Ma, R., Miao, J., Niu, L., and Zhang, P. (2019), “Transformed l1 Regularization for Learning Sparse Deep Neural Networks,” arXiv no. 1901.01021v1.
  • MacKay, D. J. (1992), “The Evidence Framework Applied to Classification Networks,” Neural Computation, 4, 720–736. DOI: 10.1162/neco.1992.4.5.720.
  • McAllester, D. (1999a), “PAC-Bayesian Model Averaging,” in Proceedings of the 12th Annual Conference on Computational Learning Theory, pp. 164–170.
  • McAllester, D. (1999b), “Some PAC-Bayesian Theorems,” Machine Learning, 37, 335–363.
  • Mhaskar, H., Liao, Q., and Poggio, T. (2017), “When and Why Are Deep Networks Better Than Shallow Ones?” in Thirty-First AAAI Conference on Artificial Intelligence.
  • Mnih, A., and Gregor, K. (2014), “Neural Variational Inference and Learning in Belief Networks,” in Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML’14 (Vol. 32), JMLR.org, pp. II-1791–II-1799.
  • Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. (2018), “Scalable Training of Artificial Neural Networks With Adaptive Sparse Connectivity Inspired by Network Science,” Nature Communications, 9, 2383. DOI: 10.1038/s41467-018-04316-3.
  • Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014), “On the Number of Linear Regions of Deep Neural Networks,” in Advances in Neural Information Processing Systems, pp. 2924–2932.
  • Mostafa, H., and Wang, X. (2019), “Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization,” arXiv no. 1902.05967.
  • Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2020), “Deep Double Descent: Where Bigger Models and More Data Hurt,” in International Conference on Learning Representations.
  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017), “Automatic Differentiation in PyTorch,” in NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA.
  • Petersen, P., and Voigtlaender, F. (2018), “Optimal Approximation of Piecewise Smooth Functions Using Deep ReLU Neural Networks,” Neural Networks, 108, 296–330. DOI: 10.1016/j.neunet.2018.08.019.
  • Polson, N. G., and Ročková, V. (2018), “Posterior Concentration for Sparse Deep Learning,” in Proceedings of the 32nd International Conferences on Neural Information Processing Systems (NeurIPS).
  • Pourzanjani, A. A., Jiang, R. M., and Petzold, L. R. (2017), “Improving the Identifiability of Neural Networks for Bayesian Inference,” in NIPS Workshop on Bayesian Deep Learning.
  • Ročková, V. (2018), “Bayesian Estimation of Sparse Signals With a Continuous Spike-and-Slab Prior,” The Annals of Statistics, 46, 401–437. DOI: 10.1214/17-AOS1554.
  • Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. (2017), “Group Sparse Regularization for Deep Neural Networks,” Neurocomputing, 241, 81–89. DOI: 10.1016/j.neucom.2017.02.029.
  • Schmidt-Hieber, J. (2017), “Nonparametric Regression Using Deep Neural Networks With ReLU Activation Function,” arXiv no. 1708.06633v2.
  • Simonyan, K., and Zisserman, A. (2014), “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv no. 1409.1556.
  • Song, Q., and Liang, F. (2017), “Nearly Optimal Bayesian Shrinkage for High Dimensional Regression,” arXiv no. 1712.08964.
  • Song, Q., Sun, Y., Ye, M., and Liang, F. (2020), “Extended Stochastic Gradient MCMC Algorithms for Large-Scale Bayesian Computing,” arXiv no. 2002.02919v1.
  • Song, Q., Wu, M., and Liang, F. (2014), “Weak Convergence Rates of Population Versus Single-Chain Stochastic Approximation MCMC Algorithms,” Advances in Applied Probability, 46, 1059–1083. DOI: 10.1239/aap/1418396243.
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014), “Dropout: A Simple Way to Prevent Neural Networks From Overfitting,” The Journal of Machine Learning Research, 15, 1929–1958.
  • Telgarsky, M. (2017), “Neural Networks and Rational Functions,” arXiv no. 1706.03301.
  • Wager, S., Wang, S., and Liang, P. S. (2013), “Dropout Training as Adaptive Regularization,” in Advances in Neural Information Processing Systems (NIPS), pp. 351–359.
  • Yarotsky, D. (2017), “Error Bounds for Approximations With Deep ReLU Networks,” Neural Networks, 94, 103–114. DOI: 10.1016/j.neunet.2017.07.002.
  • Yoon, J., and Hwang, S. J. (2017), “Combined Group and Exclusive Sparsity for Deep Neural Networks,” in Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research (Vol. 70), eds. D. Precup and Y. W. Teh, International Convention Centre, Sydney, Australia, pp. 3958–3966.
  • Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., and Poggio, T. (2018), “Theory of Deep Learning IIb: Optimization Properties of SGD,” arXiv no. 1801.02254.
  • Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2017), “Random Erasing Data Augmentation,” arXiv no. 1708.04896.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.