2,447
Views
22
CrossRef citations to date
0
Altmetric
Survey paper

A survey of multimodal deep generative models

&
Pages 261-278 | Received 17 May 2021, Accepted 21 Nov 2021, Published online: 21 Feb 2022

References

  • Stein BE, Meredith MA. The merging of the senses. Cambridge (MA): The MIT Press; 1993.
  • Baltrušaitis T, Ahuja C, Morency L-P. Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell. 2018;41(2):423–443.
  • Noda K, Arie H, Suga Y, et al. Multimodal integration learning of robot behavior using deep neural networks. Rob Auton Syst. 2014;62(6):721–736.
  • Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In: ICML, 2011.
  • Lahat D, Adali T, Jutten C. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc IEEE. 2015;103(9):1449–1477.
  • Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.
  • Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines. Advances in neural information processing systems; 2012; Nevada, USA.
  • Blei DM, Jordan MI. Modeling annotated data. The 26th annual international ACM SIGIR conference on Research and development in informaion retrieval; 2003; Toronto, Canada.
  • Nakamura T, Nagai T, Iwahashi N. Grounding of word meanings in multimodal concepts using LDA. IEEE/RSJ International Conference on Intelligent Robots and Systems; 2009; Missouri, USA.
  • Kingma DP, Welling M. Auto-encoding variational Bayes. International Conference on Learning Representations; 2014; Banff, Canada.
  • Goodfellow IJ, Pouget-Abadie J, Mirza M. Generative adversarial networks. Advances in Neural Information Processing Systems; 2014; Montreal, Canada.
  • van den Oord A, Dieleman S, Zen H, et al. Wavenet: a generative model for raw audio; 2016. Preprint arXiv:1609.03499.
  • Van Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel Recurrent Neural Networks.International Conference on Machine Learning; 2016; New York, USA.
  • Rezende D, Mohamed S. Variational inference with normalizing flows. International Conference on Machine Learning; 2015; Lille, France.
  • Kingma DP, Dhariwal P. Glow: generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems; 2018; Montreal, Canada.
  • Gershman S, Goodman N. Amortized inference in probabilistic reasoning. The annual meeting of the Cognitive Science Society; 2014; Quebec City, Canada.
  • Atrey PK, Hossain MA, El Saddik A, et al. Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 2010;16(6):345–379.
  • Xu C, Tao D, Xu C. A survey on multi-view learning; 2013. Preprint arXiv:1304.5634.
  • Sun S. A survey of multi-view machine learning. Neural Comput Appl. 2013;23(7):2031–2038.
  • Wang K, Yin Q, Wang W, et al. A comprehensive survey on cross-modal retrieval; 2016. Preprint arXiv:1607.06215.
  • Gao J, Li P, Chen Z, et al. A survey on deep learning for multimodal data fusion. Neural Comput. 2020;32(5):829–864.
  • Zhang C, Yang Z, He X, et al. Multimodal intelligence: representation learning, information fusion, and applications. IEEE J Sel Top Signal Process. 2020;14(3):478–493.
  • Li Y, Yang M, Zhang Z. A survey of multi-view representation learning. IEEE Trans Knowl Data Eng. 2018;31(10):1863–1883.
  • Tian Y, Krishnan D, Isola P. Contrastive multiview coding; 2019. Preprint arXiv:1906.05849.
  • Alayrac J-B, Recasens A, Schneider R, et al. Self-supervised multimodal versatile networks; 2020. Preprint arXiv:2006.16228.
  • Tsai Y-HH, Wu Y, Salakhutdinov R, et al. Self-supervised learning from a multi-view perspective. International Conference on Learning Representations; 2021; Vienna, Austria.
  • Sohn K, Lee H, Yan X. Learning structured output representation using deep conditional generative models. Adv Neural Inf Process Syst. 2015;28:3483–3491.
  • Mirza M, Osindero S. Conditional generative adversarial nets; 2014. Preprint arXiv:1411.1784.
  • Ivanovic B, Leung K, Schmerling E, et al. Multimodal deep generative models for trajectory prediction: a conditional variational autoencoder approach. IEEE Robot Autom Lett. 2020;6(2):295–302.
  • Shi X, Liu Q, Fan W, et al. Transfer learning on heterogenous feature spaces via spectral transformation. IEEE international conference on data mining; 2010; Sydney, Australia.
  • Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2009;22(10):1345–1359.
  • Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798–1828.
  • Tschannen M, Bachem O, Lucic M. Recent advances in autoencoder-based representation learning; 2018. Preprint arXiv:1812.05069.
  • Holyoak KJ. Parallel distributed processing: explorations in the microstructure of cognition. Science. 1987;236:992–997.
  • Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–507.
  • Xing EP, Yan R, Hauptmann AG. Mining associated text and images with dual-wing harmoniums; 2012. Preprint arXiv:1207.1423.
  • Srivastava N, Salakhutdinov R. Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning Workshop, Vol. 79; 2012. p. 3.
  • Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(7):1527–1554.
  • Hinton GE. Deep belief networks. Scholarpedia. 2009;4(5):5947.
  • Kim Y, Lee H, Provost EM. Deep learning for robust feature generation in audiovisual emotion recognition. IEEE International Conference on Acoustics, Speech and Signal Processing; 2013; Vancouver, Canada.
  • Huang J, Kingsbury B. Audio-visual deep learning for noise robust speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing; 2013; Vancouver, Canada.
  • Salakhutdinov R, Hinton G. Deep Boltzmann machines. The International Conference on Artificial Intelligence and Statistics; 2009; Florida, USA.
  • Sohn K, Shang W, Lee H. Improved multimodal deep learning with variation of information. Adv Neural Inf Process Syst. 2014;27:2141–2149.
  • Ouyang W, Chu X, Wang X. Multi-source deep learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 2329–2336.
  • Suk H-I, Lee S-W, Shen D, et al. Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis. NeuroImage. 2014;101:569–582.
  • Pang L, Zhu S, Ngo C-W. Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimedia. 2015;17(11):2008–2020.
  • Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Machine Learning Res. 2003;3:993–1022.
  • Putthividhy D, Attias HT, Nagarajan SS. Topic regression multi-modal latent Dirichlet allocation for image annotation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2010; San Francisco, CA.
  • Nakamura T, Nagai T, Iwahashi N. Bag of multimodal LDA models for concept formation. IEEE International Conference on Robotics and Automation; 2011; Shanghai, China.
  • Araki T, Nakamura T, Nagai Twe, et al. Online learning of concepts and words using multimodal lda and hierarchical Pitman-Yor language model. The IEEE/RSJ International Conference on Intelligent Robots and Systems; 2012; Vilamoura, Algarve.
  • Zheng Y, Zhang Y-J, Larochelle H. Topic modeling of multimodal data: an autoregressive approach. IEEE Conference on Computer Vision and Pattern Recognition; 2014; Columbus, OH.
  • Nakamura T, Nagai T, Funakoshi K, et al. Mutual learning of an object concept and language model based on MLDA and NPYLM. IEEE/RSJ International Conference on Intelligent Robots and Systems; 2014; Chicago, IL.
  • Higgins I, Matthey L, Pal A, et al. beta-vae: learning basic visual concepts with a constrained variational framework. ICLR; 2016.
  • Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Machine Learning Res. 2010;11(12):3371–3408.
  • Jaques N, Taylor S, Sano A, et al. Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction. International Conference on Affective Computing and Intelligent Interaction; 2017; San Antonio, TX.
  • Zhang H, Xu T, Li H, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. IEEE International Conference on Computer Vision; 2017; Venice, Italy.
  • Zhu J-Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. IEEE International Conference on Computer Vision; 2017; Venice, Italy.
  • Isola P, Zhu J-Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition; 2017; Venice, Italy.
  • Liu M-Y, Breuel T, Kautz J. Unsupervised image-to-image translation networks; 2017. Preprint arXiv:1703.00848.
  • Lee H-Y, Tseng H-Y, Huang J-B, et al. Diverse image-to-image translation via disentangled representations. European Conference on Computer Vision; 2018; Munich, Germany.
  • Wu M, Goodman N. Multimodal generative models for compositional representation learning; 2019. Preprint arXiv:1912.05075.
  • Suzuki M, Nakayama K, Matsuo Y. Improving bi-directional generation between different modalities with variational autoencoders; 2018. Preprint arXiv:1801.08702.
  • Daunhawer I, Sutter TM, Marcinkevičs R, et al. Self-supervised disentanglement of modality-specific and shared factors improves multimodal generative models. Pattern Recognit. 2021;12544:459.
  • Sutter TM, Daunhawer I, Vogt JE. Multimodal generative learning utilizing Jensen-Shannon-divergence; 2020. Preprint arXiv:2006.08242.
  • Wu M, Goodman N. Multimodal generative models for scalable weakly-supervised learning.  Montréal (Canada): Advances in Neural Information Processing Systems; 2018.
  • Suzuki M, Nakayama K, Matsuo Y. Joint multimodal learning with deep generative models; 2016. Preprint arXiv:1611.01891.
  • Vedantam R, Fischer I, Huang J, et al. Generative models of visually grounded imagination; 2017. Preprint arXiv:1705.10762.
  • Higgins I, Sonnerat N, Matthey L, et al. Scan: learning hierarchical compositional visual concepts; 2017. Preprint arXiv:1707.03389.
  • Yin H, Melo F, Billard A, et al. Associate latent encodings in learning from demonstrations. Association for the Advancement of Artificial Intelligence; 2017; San Francisco, CA.
  • Tian Y, Engel J. Latent translation: crossing modalities by bridging generative models; 2019. Preprint arXiv:1902.08261.
  • Wang W, Yan X, Lee H, et al. Deep variational canonical correlation analysis; 2016. Preprint arXiv:1610.03454.
  • Schonfeld E, Ebrahimi S, Sinha S, et al. Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 8247–8255.
  • Jo DU, Lee B, Choi J, et al. Cross-modal variational auto-encoder with distributed latent spaces and associators; 2019. Preprint arXiv:1905.12867.
  • Korthals T, Rudolph D, Leitner J, et al. Multi-modal generative models for learning epistemic active sensing. International Conference on Robotics and Automation; 2019; Montreal, Canada.
  • Shi Y, Siddharth N, Paige B, et al. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Vancouver (Canada): Advances in Neural Information Processing Systems; 2019.
  • Yadav R, Sardana A, Namboodiri V. Bridged variational autoencoders for joint modeling of images and attributes. IEEE/CVF Winter Conference on Applications of Computer Vision; 2020; Snowmass, CO.
  • Sutter TM, Daunhawer I, Vogt JE. Generalized multimodal ELBO. International Conference on Learning Representations; 2021; Vienna, Austria.
  • Hsu W-N, Glass J. Disentangling by partitioning: a representation learning framework for multimodal sensory data; 2018. Preprint arXiv:1805.11264.
  • Lee M, Pavlovic V. Private-shared disentangled multimodal vae for learning of hybrid latent representations; 2020. Preprint arXiv:2012.13024.
  • Tsai Y-HH, Liang PP, Zadeh A, et al. Learning factorized multimodal representations; 2018. Preprint arXiv:1806.06176.
  • Shi Y, Paige B, Torr P, et al. Relating by contrasting: a data-efficient framework for multimodal generative models. International Conference on Learning Representations; 2021; Vienna, Austria.
  • Bonneel N, Rabin J, Peyré G, et al. Sliced and radon Wasserstein barycenters of measures. J Math Imaging Vis. 2015;51(1):22–45.
  • Hotelling H. Relations between two sets of variates. Biometrika.1936;28:321–377.
  • Xian Y, Schiele B, Akata Z. Zero-shot learning-the good, the bad and the ugly. IEEE Conference on Computer Vision and Pattern Recognition; 2017; Honolulu, HI.
  • Pourpanah F, Abdar M, Luo Y, et al. A review of generalized zero-shot learning methods; 2020. Preprint arXiv:2011.08641.
  • Jo DU, Lee B, Choi J, et al. Associative variational auto-encoder with distributed latent spaces and associators. AAAI Conference on Artificial Intelligence; 2020; New York, USA.
  • Bliss TVP, Collingridge GL. A synaptic model of memory: long-term potentiation in the hippocampus. Nature. 1993;361(6407):31–39.
  • Wang D, Cui P, Ou M, et al. Deep multimodal hashing with orthogonal regularization. .International Joint Conference on Artificial Intelligence; 2015; Buenos Aires, Argentina.
  • Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002;14(8):1771–1800.
  • Burda Y, Grosse R, Salakhutdinov R. Importance weighted autoencoders; 2015. Preprint arXiv:1509.00519.
  • Bouchacourt D, Tomioka R, Nowozin S. Multi-level variational autoencoder: learning disentangled representations from grouped observations. AAAI Conference on Artificial Intelligence; 2018; New Orleans, USA.
  • Maddison CJ, Mnih A, Teh YW. The concrete distribution: a continuous relaxation of discrete random variables. International Conference on Learning Representations; 2017; Toulon, France.
  • Jang E, Gu S, Poole B. Categorical reparameterization with Gumbel-Softmax. International Conference on Learning Representations; 2017; Toulon, France.
  • van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding; 2018. Preprint arXiv:1807.03748.
  • Sugiyama M, Suzuki T, Kanamori T. Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann Inst Stat Math. 2012;64(5):1009–1044.
  • Tolstikhin I, Bousquet O, Gelly S, et al. Wasserstein auto-encoders. International Conference on Learning Representations; 2018; Vancouver, BC.
  • Dieng AB, Tran D, Ranganath R, et al. Variational inference via χ-upper bound minimization. Advances in neural information processing systems; 2017; Long Beach, CA.
  • LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–2324.
  • Xiao H, Rasul K, Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. 2017. Preprint arXiv:1708.07747
  • Netzer Y, Wang T, Coates A, et al. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning; 2011; Granada, Spain.
  • Liu Z, Luo P, Wang X, et al. Deep learning face attributes in the wild. International Conference on Computer Vision; 2015; Santiago, Chile.
  • Welinder P, Branson S, Mita T, et al. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001. California Institute of Technology; 2010.
  • Wah C, Branson S, Welinder P, et al. The Caltech-UCSD birds-200-2011 dataset. Technical Report; 2011.
  • Zimmermann C, Brox T. Learning to estimate 3d hand pose from single rgb images. IEEE International Conference on Computer Vision; 2017; Venice, Italy.
  • Zadeh A, Zellers R, Pincus E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst. 2016;31(6):82–88.
  • Sadeghi M, Alameda-Pineda X. Robust unsupervised audio-visual speech enhancement using a mixture of variational autoencoders. IEEE International Conference on Acoustics, Speech and Signal Processing; 2020; Barcelona, Spain.
  • Sadeghi M, Leglaive S, Alameda-Pineda X, et al. Audio-visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1788–1800.
  • Bloesch M, Czarnowski J, Clark R, et al. CodeSLAM—learning a compact, optimisable representation for dense visual SLAM. IEEE Conference on Computer Vision and Pattern Recognition; 2018; Salt Lake City, UT.
  • Zambelli M, Cully A, Demiris Y. Multimodal representation models for prediction and control from partial information. Rob Auton Syst. 2020;123:103312.
  • Metta G, Sandini G, Vernon D, et al. The iCub humanoid robot: an open platform for research in embodied cognition. Workshop on performance metrics for intelligent systems; 2008; Gaithersburg, MD.
  • Park D, Hoshi Y, Kemp CC. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Robot Autom Lett. 2018;3(3):1544–1551.
  • Meo C, Lanillos P. Multimodal VAE active inference controller; 2021. Preprint arXiv:2103.04412.
  • Nakamura T, Nagai T, Taniguchi T. Serket: an architecture for connecting stochastic models to realize a large-scale cognitive model. Front Neurorobot. 2018;12:25.
  • Taniguchi T, Nakamura T, Suzuki M, et al. Neuro-SERKET: development of integrative cognitive system through the composition of deep probabilistic generative models. New Generation Comput. 2020;38:23–48.
  • Baars BJ. A cognitive theory of consciousness. Cambridge: Cambridge University Press; 1988.
  • Goyal A, Didolkar A, Lamb A, et al. Coordination among neural modules through a shared global workspace; 2021. Preprint arXiv:2103.01197.
  • Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing System; 2017; Long Beach, CA.
  • Goyal A, Lamb A, Hoffmann J, et al. Recurrent independent mechanisms. International Conference on Learning Representations; 2021; Vienna, Austria.
  • Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems; 2018; Montreal, Canada.
  • Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels. International Conference on Machine Learning; 2019; Long Beach, CA.
  • Hafner D, Lillicrap T, Ba J, et al. Dream to control: learning behaviors by latent imagination. International Conference on Learning Representations; 2019; New Orleans, USA.
  • Taniguchi T, Yamakawa H, Nagai T, et al. Whole brain probabilistic generative model toward realizing cognitive architecture for developmental robots; 2021. Preprint arXiv:2103.08183.
  • Thrun S, Mitchell TM. Lifelong robot learning. Rob Auton Syst. 1995;15(1–2):25–46.
  • Lesort T, Lomonaco V, Stoian A, et al. Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. Inform Fusion. 2020;58:52–68.
  • Bingham E, Chen JP, Jankowiak M, et al. Pyro: deep universal probabilistic programming; 2018. Preprint arXiv:1810.09538.
  • Suzuki M, Kaneko T, Matsuo Y. Pixyz: a library for developing deep generative models; 2021. Preprint arXiv: 2107.13109.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.