2,447
Views
22
CrossRef citations to date
0
Altmetric
Survey paper

A survey of multimodal deep generative models

&
Pages 261-278 | Received 17 May 2021, Accepted 21 Nov 2021, Published online: 21 Feb 2022
 

Abstract

Multimodal learning is a framework for building models that make predictions based on different types of modalities. Important challenges in multimodal learning are the inference of shared representations from arbitrary modalities and cross-modal generation via these representations; however, achieving this requires taking the heterogeneous nature of multimodal data into account. In recent years, deep generative models, i.e. generative models in which distributions are parameterized by deep neural networks, have attracted much attention, especially variational autoencoders, which are suitable for accomplishing the above challenges because they can consider heterogeneity and infer good representations of data. Therefore, various multimodal generative models based on variational autoencoders, called multimodal deep generative models, have been proposed in recent years. In this paper, we provide a categorized survey of studies on multimodal deep generative models.

GRAPHICAL ABSTRACT

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 Xk means a subset of multimodal data different from Xk.

2 In the original text, a shared representation is called a joint representation; we have changed this terminology to match that in this paper. Strictly speaking, there is a subtle difference between shared representation and joint representation: the former refers to a space shared by different modalities while the latter refers to a shared representation of the joint model (see Section 5), i.e. a space in which different modalities are fused.

3 In this paper, we use the term ‘encoder’ to refer to any form of mapping from an input space to a latent space, whether deterministic or probabilistic, and ‘inference distribution’ to refer specifically to the conditional distribution qϕ(z|x).

4 Note that the inference distributions are the same when the outputs of the encoder networks (i.e. the parameters of the inference distributions) are the same, not necessarily when the values of the training parameters of these networks are all the same.

5 The 2-Wasserstein distance is the p-Wasserstein distance of order p = 2.

6 In the original paper [Citation66], JVAE and JMVAE are referred to as JMVAE and JMVAE-kl, respectively. However, because some papers refer to JMVAE-kl as JMVAE [Citation65,Citation67], we follow them to avoid confusion in terminology.

7 Recall that deep generative models are generative models whose distributions are parameterized by deep neural networks; therefore, if these models are large, their implementation can be much more complex than the implementation of regular generative models.

Additional information

Funding

This paper is based on results obtained from a project, JPNP16007, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

Notes on contributors

Masahiro Suzuki

Masahiro Suzuki is a project assistant professor in the Graduate School of Engineering at the University of Tokyo. He was formerly a project researcher at the University of Tokyo from 2018 to 2020. He received his PhD from the University of Tokyo in 2018 and his MS degree from Hokkaido University in 2015. His research interests are in deep generative models, multimodal learning, and transfer learning.

Yutaka Matsuo

Yutaka Matsuo is a professor at the Graduate School of Engineering, the University of Tokyo. He received his BS, MS, and PhD degrees from the University of Tokyo in 1997, 1999, and 2002. After working at the National Institute of Advanced Industrial Science and Technology (AIST) and Stanford University, he joined the faculty of University of Tokyo in 2007. He served as Editor-in-chief from 2012 to 2014, and as the chair of the ELSI committee from 2014 to 2018 at Japan Society for Artificial Intelligence (JSAI). He is the president of the Japanese Deep Learning Association (JDLA), and a member of the board of directors at SoftBank Group Corp. He is working on artificial intelligence, especially on deep learning and web mining.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 332.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.