References
- Stein BE, Meredith MA. The merging of the senses. Cambridge (MA): The MIT Press; 1993.
- Baltrušaitis T, Ahuja C, Morency L-P. Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell. 2018;41(2):423–443.
- Noda K, Arie H, Suga Y, et al. Multimodal integration learning of robot behavior using deep neural networks. Rob Auton Syst. 2014;62(6):721–736.
- Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In: ICML, 2011.
- Lahat D, Adali T, Jutten C. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc IEEE. 2015;103(9):1449–1477.
- Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.
- Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines. Advances in neural information processing systems; 2012; Nevada, USA.
- Blei DM, Jordan MI. Modeling annotated data. The 26th annual international ACM SIGIR conference on Research and development in informaion retrieval; 2003; Toronto, Canada.
- Nakamura T, Nagai T, Iwahashi N. Grounding of word meanings in multimodal concepts using LDA. IEEE/RSJ International Conference on Intelligent Robots and Systems; 2009; Missouri, USA.
- Kingma DP, Welling M. Auto-encoding variational Bayes. International Conference on Learning Representations; 2014; Banff, Canada.
- Goodfellow IJ, Pouget-Abadie J, Mirza M. Generative adversarial networks. Advances in Neural Information Processing Systems; 2014; Montreal, Canada.
- van den Oord A, Dieleman S, Zen H, et al. Wavenet: a generative model for raw audio; 2016. Preprint arXiv:1609.03499.
- Van Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel Recurrent Neural Networks.International Conference on Machine Learning; 2016; New York, USA.
- Rezende D, Mohamed S. Variational inference with normalizing flows. International Conference on Machine Learning; 2015; Lille, France.
- Kingma DP, Dhariwal P. Glow: generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems; 2018; Montreal, Canada.
- Gershman S, Goodman N. Amortized inference in probabilistic reasoning. The annual meeting of the Cognitive Science Society; 2014; Quebec City, Canada.
- Atrey PK, Hossain MA, El Saddik A, et al. Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 2010;16(6):345–379.
- Xu C, Tao D, Xu C. A survey on multi-view learning; 2013. Preprint arXiv:1304.5634.
- Sun S. A survey of multi-view machine learning. Neural Comput Appl. 2013;23(7):2031–2038.
- Wang K, Yin Q, Wang W, et al. A comprehensive survey on cross-modal retrieval; 2016. Preprint arXiv:1607.06215.
- Gao J, Li P, Chen Z, et al. A survey on deep learning for multimodal data fusion. Neural Comput. 2020;32(5):829–864.
- Zhang C, Yang Z, He X, et al. Multimodal intelligence: representation learning, information fusion, and applications. IEEE J Sel Top Signal Process. 2020;14(3):478–493.
- Li Y, Yang M, Zhang Z. A survey of multi-view representation learning. IEEE Trans Knowl Data Eng. 2018;31(10):1863–1883.
- Tian Y, Krishnan D, Isola P. Contrastive multiview coding; 2019. Preprint arXiv:1906.05849.
- Alayrac J-B, Recasens A, Schneider R, et al. Self-supervised multimodal versatile networks; 2020. Preprint arXiv:2006.16228.
- Tsai Y-HH, Wu Y, Salakhutdinov R, et al. Self-supervised learning from a multi-view perspective. International Conference on Learning Representations; 2021; Vienna, Austria.
- Sohn K, Lee H, Yan X. Learning structured output representation using deep conditional generative models. Adv Neural Inf Process Syst. 2015;28:3483–3491.
- Mirza M, Osindero S. Conditional generative adversarial nets; 2014. Preprint arXiv:1411.1784.
- Ivanovic B, Leung K, Schmerling E, et al. Multimodal deep generative models for trajectory prediction: a conditional variational autoencoder approach. IEEE Robot Autom Lett. 2020;6(2):295–302.
- Shi X, Liu Q, Fan W, et al. Transfer learning on heterogenous feature spaces via spectral transformation. IEEE international conference on data mining; 2010; Sydney, Australia.
- Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2009;22(10):1345–1359.
- Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798–1828.
- Tschannen M, Bachem O, Lucic M. Recent advances in autoencoder-based representation learning; 2018. Preprint arXiv:1812.05069.
- Holyoak KJ. Parallel distributed processing: explorations in the microstructure of cognition. Science. 1987;236:992–997.
- Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–507.
- Xing EP, Yan R, Hauptmann AG. Mining associated text and images with dual-wing harmoniums; 2012. Preprint arXiv:1207.1423.
- Srivastava N, Salakhutdinov R. Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning Workshop, Vol. 79; 2012. p. 3.
- Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(7):1527–1554.
- Hinton GE. Deep belief networks. Scholarpedia. 2009;4(5):5947.
- Kim Y, Lee H, Provost EM. Deep learning for robust feature generation in audiovisual emotion recognition. IEEE International Conference on Acoustics, Speech and Signal Processing; 2013; Vancouver, Canada.
- Huang J, Kingsbury B. Audio-visual deep learning for noise robust speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing; 2013; Vancouver, Canada.
- Salakhutdinov R, Hinton G. Deep Boltzmann machines. The International Conference on Artificial Intelligence and Statistics; 2009; Florida, USA.
- Sohn K, Shang W, Lee H. Improved multimodal deep learning with variation of information. Adv Neural Inf Process Syst. 2014;27:2141–2149.
- Ouyang W, Chu X, Wang X. Multi-source deep learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 2329–2336.
- Suk H-I, Lee S-W, Shen D, et al. Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis. NeuroImage. 2014;101:569–582.
- Pang L, Zhu S, Ngo C-W. Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimedia. 2015;17(11):2008–2020.
- Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Machine Learning Res. 2003;3:993–1022.
- Putthividhy D, Attias HT, Nagarajan SS. Topic regression multi-modal latent Dirichlet allocation for image annotation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2010; San Francisco, CA.
- Nakamura T, Nagai T, Iwahashi N. Bag of multimodal LDA models for concept formation. IEEE International Conference on Robotics and Automation; 2011; Shanghai, China.
- Araki T, Nakamura T, Nagai Twe, et al. Online learning of concepts and words using multimodal lda and hierarchical Pitman-Yor language model. The IEEE/RSJ International Conference on Intelligent Robots and Systems; 2012; Vilamoura, Algarve.
- Zheng Y, Zhang Y-J, Larochelle H. Topic modeling of multimodal data: an autoregressive approach. IEEE Conference on Computer Vision and Pattern Recognition; 2014; Columbus, OH.
- Nakamura T, Nagai T, Funakoshi K, et al. Mutual learning of an object concept and language model based on MLDA and NPYLM. IEEE/RSJ International Conference on Intelligent Robots and Systems; 2014; Chicago, IL.
- Higgins I, Matthey L, Pal A, et al. beta-vae: learning basic visual concepts with a constrained variational framework. ICLR; 2016.
- Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Machine Learning Res. 2010;11(12):3371–3408.
- Jaques N, Taylor S, Sano A, et al. Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction. International Conference on Affective Computing and Intelligent Interaction; 2017; San Antonio, TX.
- Zhang H, Xu T, Li H, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. IEEE International Conference on Computer Vision; 2017; Venice, Italy.
- Zhu J-Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. IEEE International Conference on Computer Vision; 2017; Venice, Italy.
- Isola P, Zhu J-Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition; 2017; Venice, Italy.
- Liu M-Y, Breuel T, Kautz J. Unsupervised image-to-image translation networks; 2017. Preprint arXiv:1703.00848.
- Lee H-Y, Tseng H-Y, Huang J-B, et al. Diverse image-to-image translation via disentangled representations. European Conference on Computer Vision; 2018; Munich, Germany.
- Wu M, Goodman N. Multimodal generative models for compositional representation learning; 2019. Preprint arXiv:1912.05075.
- Suzuki M, Nakayama K, Matsuo Y. Improving bi-directional generation between different modalities with variational autoencoders; 2018. Preprint arXiv:1801.08702.
- Daunhawer I, Sutter TM, Marcinkevičs R, et al. Self-supervised disentanglement of modality-specific and shared factors improves multimodal generative models. Pattern Recognit. 2021;12544:459.
- Sutter TM, Daunhawer I, Vogt JE. Multimodal generative learning utilizing Jensen-Shannon-divergence; 2020. Preprint arXiv:2006.08242.
- Wu M, Goodman N. Multimodal generative models for scalable weakly-supervised learning. Montréal (Canada): Advances in Neural Information Processing Systems; 2018.
- Suzuki M, Nakayama K, Matsuo Y. Joint multimodal learning with deep generative models; 2016. Preprint arXiv:1611.01891.
- Vedantam R, Fischer I, Huang J, et al. Generative models of visually grounded imagination; 2017. Preprint arXiv:1705.10762.
- Higgins I, Sonnerat N, Matthey L, et al. Scan: learning hierarchical compositional visual concepts; 2017. Preprint arXiv:1707.03389.
- Yin H, Melo F, Billard A, et al. Associate latent encodings in learning from demonstrations. Association for the Advancement of Artificial Intelligence; 2017; San Francisco, CA.
- Tian Y, Engel J. Latent translation: crossing modalities by bridging generative models; 2019. Preprint arXiv:1902.08261.
- Wang W, Yan X, Lee H, et al. Deep variational canonical correlation analysis; 2016. Preprint arXiv:1610.03454.
- Schonfeld E, Ebrahimi S, Sinha S, et al. Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 8247–8255.
- Jo DU, Lee B, Choi J, et al. Cross-modal variational auto-encoder with distributed latent spaces and associators; 2019. Preprint arXiv:1905.12867.
- Korthals T, Rudolph D, Leitner J, et al. Multi-modal generative models for learning epistemic active sensing. International Conference on Robotics and Automation; 2019; Montreal, Canada.
- Shi Y, Siddharth N, Paige B, et al. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Vancouver (Canada): Advances in Neural Information Processing Systems; 2019.
- Yadav R, Sardana A, Namboodiri V. Bridged variational autoencoders for joint modeling of images and attributes. IEEE/CVF Winter Conference on Applications of Computer Vision; 2020; Snowmass, CO.
- Sutter TM, Daunhawer I, Vogt JE. Generalized multimodal ELBO. International Conference on Learning Representations; 2021; Vienna, Austria.
- Hsu W-N, Glass J. Disentangling by partitioning: a representation learning framework for multimodal sensory data; 2018. Preprint arXiv:1805.11264.
- Lee M, Pavlovic V. Private-shared disentangled multimodal vae for learning of hybrid latent representations; 2020. Preprint arXiv:2012.13024.
- Tsai Y-HH, Liang PP, Zadeh A, et al. Learning factorized multimodal representations; 2018. Preprint arXiv:1806.06176.
- Shi Y, Paige B, Torr P, et al. Relating by contrasting: a data-efficient framework for multimodal generative models. International Conference on Learning Representations; 2021; Vienna, Austria.
- Bonneel N, Rabin J, Peyré G, et al. Sliced and radon Wasserstein barycenters of measures. J Math Imaging Vis. 2015;51(1):22–45.
- Hotelling H. Relations between two sets of variates. Biometrika.1936;28:321–377.
- Xian Y, Schiele B, Akata Z. Zero-shot learning-the good, the bad and the ugly. IEEE Conference on Computer Vision and Pattern Recognition; 2017; Honolulu, HI.
- Pourpanah F, Abdar M, Luo Y, et al. A review of generalized zero-shot learning methods; 2020. Preprint arXiv:2011.08641.
- Jo DU, Lee B, Choi J, et al. Associative variational auto-encoder with distributed latent spaces and associators. AAAI Conference on Artificial Intelligence; 2020; New York, USA.
- Bliss TVP, Collingridge GL. A synaptic model of memory: long-term potentiation in the hippocampus. Nature. 1993;361(6407):31–39.
- Wang D, Cui P, Ou M, et al. Deep multimodal hashing with orthogonal regularization. .International Joint Conference on Artificial Intelligence; 2015; Buenos Aires, Argentina.
- Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002;14(8):1771–1800.
- Burda Y, Grosse R, Salakhutdinov R. Importance weighted autoencoders; 2015. Preprint arXiv:1509.00519.
- Bouchacourt D, Tomioka R, Nowozin S. Multi-level variational autoencoder: learning disentangled representations from grouped observations. AAAI Conference on Artificial Intelligence; 2018; New Orleans, USA.
- Maddison CJ, Mnih A, Teh YW. The concrete distribution: a continuous relaxation of discrete random variables. International Conference on Learning Representations; 2017; Toulon, France.
- Jang E, Gu S, Poole B. Categorical reparameterization with Gumbel-Softmax. International Conference on Learning Representations; 2017; Toulon, France.
- van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding; 2018. Preprint arXiv:1807.03748.
- Sugiyama M, Suzuki T, Kanamori T. Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann Inst Stat Math. 2012;64(5):1009–1044.
- Tolstikhin I, Bousquet O, Gelly S, et al. Wasserstein auto-encoders. International Conference on Learning Representations; 2018; Vancouver, BC.
- Dieng AB, Tran D, Ranganath R, et al. Variational inference via χ-upper bound minimization. Advances in neural information processing systems; 2017; Long Beach, CA.
- LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–2324.
- Xiao H, Rasul K, Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. 2017. Preprint arXiv:1708.07747
- Netzer Y, Wang T, Coates A, et al. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning; 2011; Granada, Spain.
- Liu Z, Luo P, Wang X, et al. Deep learning face attributes in the wild. International Conference on Computer Vision; 2015; Santiago, Chile.
- Welinder P, Branson S, Mita T, et al. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001. California Institute of Technology; 2010.
- Wah C, Branson S, Welinder P, et al. The Caltech-UCSD birds-200-2011 dataset. Technical Report; 2011.
- Zimmermann C, Brox T. Learning to estimate 3d hand pose from single rgb images. IEEE International Conference on Computer Vision; 2017; Venice, Italy.
- Zadeh A, Zellers R, Pincus E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst. 2016;31(6):82–88.
- Sadeghi M, Alameda-Pineda X. Robust unsupervised audio-visual speech enhancement using a mixture of variational autoencoders. IEEE International Conference on Acoustics, Speech and Signal Processing; 2020; Barcelona, Spain.
- Sadeghi M, Leglaive S, Alameda-Pineda X, et al. Audio-visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1788–1800.
- Bloesch M, Czarnowski J, Clark R, et al. CodeSLAM—learning a compact, optimisable representation for dense visual SLAM. IEEE Conference on Computer Vision and Pattern Recognition; 2018; Salt Lake City, UT.
- Zambelli M, Cully A, Demiris Y. Multimodal representation models for prediction and control from partial information. Rob Auton Syst. 2020;123:103312.
- Metta G, Sandini G, Vernon D, et al. The iCub humanoid robot: an open platform for research in embodied cognition. Workshop on performance metrics for intelligent systems; 2008; Gaithersburg, MD.
- Park D, Hoshi Y, Kemp CC. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Robot Autom Lett. 2018;3(3):1544–1551.
- Meo C, Lanillos P. Multimodal VAE active inference controller; 2021. Preprint arXiv:2103.04412.
- Nakamura T, Nagai T, Taniguchi T. Serket: an architecture for connecting stochastic models to realize a large-scale cognitive model. Front Neurorobot. 2018;12:25.
- Taniguchi T, Nakamura T, Suzuki M, et al. Neuro-SERKET: development of integrative cognitive system through the composition of deep probabilistic generative models. New Generation Comput. 2020;38:23–48.
- Baars BJ. A cognitive theory of consciousness. Cambridge: Cambridge University Press; 1988.
- Goyal A, Didolkar A, Lamb A, et al. Coordination among neural modules through a shared global workspace; 2021. Preprint arXiv:2103.01197.
- Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing System; 2017; Long Beach, CA.
- Goyal A, Lamb A, Hoffmann J, et al. Recurrent independent mechanisms. International Conference on Learning Representations; 2021; Vienna, Austria.
- Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems; 2018; Montreal, Canada.
- Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels. International Conference on Machine Learning; 2019; Long Beach, CA.
- Hafner D, Lillicrap T, Ba J, et al. Dream to control: learning behaviors by latent imagination. International Conference on Learning Representations; 2019; New Orleans, USA.
- Taniguchi T, Yamakawa H, Nagai T, et al. Whole brain probabilistic generative model toward realizing cognitive architecture for developmental robots; 2021. Preprint arXiv:2103.08183.
- Thrun S, Mitchell TM. Lifelong robot learning. Rob Auton Syst. 1995;15(1–2):25–46.
- Lesort T, Lomonaco V, Stoian A, et al. Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. Inform Fusion. 2020;58:52–68.
- Bingham E, Chen JP, Jankowiak M, et al. Pyro: deep universal probabilistic programming; 2018. Preprint arXiv:1810.09538.
- Suzuki M, Kaneko T, Matsuo Y. Pixyz: a library for developing deep generative models; 2021. Preprint arXiv: 2107.13109.