Full article: Remarks on kernel Bayes’ rule

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The kernel Bayes’ rule has been proposed as a nonparametric kernel-based method to realize Bayesian inference in reproducing kernel Hilbert spaces. However, we demonstrate both theoretically and experimentally that the way of incorporating the prior in the kernel Bayes’ rule is unnatural. In particular, we show that under some reasonable conditions, the posterior in the kernel Bayes’ rule is completely unaffected by the prior, which seems to be irrelevant in the context of Bayesian inference. We consider that this phenomenon is in part due to the fact that the assumptions in the kernel Bayes’ rule do not hold in general.

Keywords:

Public interest statement

This paper examines the validity of the kernel Bayes’ rule, a recently proposed nonparametric framework for Bayesian inference. The researchers on the kernel Bayes’ rule are aiming to apply this method to a wide range of Bayesian inference problems. However, as we demonstrate in this paper, the way of incorporating the prior in the kernel Bayes’ rule seems wrong in the context of Bayesian inference. Several theorems of the kernel Bayes’ rule rely on a strong assumption which does not hold in general.

The problems of the kernel Bayes’ rule seem to be nontrivial and difficult, and we have currently no idea to solve them. We hope that this study would trigger reexamination and correction of the basic framework of the kernel Bayes’ rule.

1. Introduction

The kernel Bayes’ rule has recently emerged as a novel framework for Bayesian inference (Fukumizu, Song, & Gretton, Citation2013; Song, Fukumizu, & Gretton, Citation2014; Song, Huang, Smola, & Fukumizu, Citation2009). It is generally agreed that, in this framework, we can estimate the kernel mean of the posterior distribution, given kernel mean expressions of the prior and likelihood distributions. Since the distributions are mapped and nonparametrically manipulated in infinite-dimensional feature spaces called reproducing kernel Hilbert spaces (RKHS), it is believed that the kernel Bayes’ rule can accurately evaluate the statistical features of high-dimensional data and enable Bayesian inference even if there were no appropriate parametric models. To date, several applications of the kernel Bayes’ rule have been reported (Fukumizu et al., Citation2013; Kanagawa et al., Citation2014). However, the basic theory and the algorithm of the kernel Bayes’ rule might need to be modified because of the following reasons:

(1)	The posterior in the kernel Bayes’ rule is in some cases completely unaffected by the prior.
(2)	The posterior in the kernel Bayes’ rule considerably depends upon the choice of the parameters to regularize covariance operators.
(3)	It does not hold in general that conditional expectation functions are included in the RKHS, which is an essential assumption of the kernel Bayes’ rule.

This paper is organized as follows. We begin in Section 2 with a brief review of the kernel Bayes’ rule. In Section 3, we theoretically address the three arguments described above. Numerical experiments are performed in Section 4 to confirm the theoretical results in Section 3. In Section 5, we summarize the theoretical and experimental results and present our conclusions. Some of the proofs for Sections 2 and 3 are given in Section 6.

2. Kernel Bayes’ rule

In this section, we briefly review the kernel Bayes’ rule following (Fukumizu et al., Citation2013). Let $X$ and $Y$ be measurable spaces, $(X, Y)$ be a random variable with an observed distribution P on $X \times Y$ , U be a random variable with the prior distribution $Π$ on $X$ , and $(Z, W)$ be a random variable with the joint distribution Q on $X \times Y$ . Note that Q is defined by the prior $Π$ and the family ${P_{Y | x} | x \in X}$ , where $P_{Y | x}$ denotes the conditional distribution of Y given $X = x$ . For each $y \in Y$ , let $Q_{X | y}$ represent the posterior distribution of Z given $W = y$ . The aim of the kernel Bayes’ rule is to derive the kernel mean of $Q_{X | y}$ .

Definition 1

Let $k_{X}$ and $k_{Y}$ be measurable positive definite kernels on $X$ and $Y$ such that $E [k_{X} (X, X)] < \infty$ and $E [k_{Y} (Y, Y)] < \infty$ , respectively, where $E [\cdot]$ denotes the expectation operator. Let $H_{X}$ and $H_{Y}$ be the RKHS defined by $k_{X}$ and $k_{Y}$ , respectively. We consider two bounded linear operators $C_{YX} : H_{X} \to H_{Y}$ and $C_{XX} : H_{X} \to H_{X}$ such that(1) $\begin{matrix} {⟨ g, C_{YX} f ⟩}_{H_{Y}} = E [f (X) g (Y)] and {⟨ f_{1}, C_{XX} f_{2} ⟩}_{H_{X}} = E [f_{1} (X) f_{2} (X)] \end{matrix}$ (1)

for any $f, f_{1}, f_{2} \in H_{X}$ and $g \in H_{Y}$ , where ${⟨ \cdot, \cdot ⟩}_{H_{X}}$ and ${⟨ \cdot, \cdot ⟩}_{H_{Y}}$ denote inner products on $H_{X}$ and $H_{Y}$ , respectively. The integral expressions for $C_{YX}$ and $C_{XX}$ are given by $\begin{matrix} (C_{YX} f) (\cdot) = \int_{X \times Y} k_{Y} (\cdot, y) f (x) d P (x, y) and (C_{XX} f) (\cdot) = \int_{X} k_{X} (\cdot, x) f (x) d P_{X} (x), \end{matrix}$

where $P_{X}$ denotes the marginal distribution of X. Let $C_{XY}$ be the bounded linear operator defined by $\begin{matrix} {⟨ f, C_{XY} ⟩}_{H_{X}} = E [f (X) g (Y)] \end{matrix}$

for any $f \in H_{X}$ and $g \in H_{Y}$ . Then $C_{XY}$ is the adjoint of $C_{YX}$ .

Theorem 1

(Fukumizu et al., Citation2013, Theorem 1) If $E [g (Y) | X = \cdot] \in H_{X}$ for $g \in H_{Y}$ , then $C_{XX} E [g (Y) | X = \cdot] = C_{XY} g$ .

Definition 2

Let $Q_{Y}$ denote the marginal distribution of W. Assuming that $E [k_{X} (U, U)] < \infty$ and $E [k_{Y} (W, W)] < \infty$ , we can define the kernel means of $Π$ and $Q_{Y}$ by $\begin{matrix} m_{Π} = E [k_{X} (\cdot, U)] and m_{Q_{Y}} = E [k_{Y} (\cdot, W)], \end{matrix}$

respectively. Due to the reproducing properties of $H_{X}$ and $H_{Y}$ , the kernel means satisfy ${⟨ f, m_{Π} ⟩}_{H_{X}} = E [f (U)]$ and ${⟨ g, m_{Q_{Y}} ⟩}_{H_{Y}} = E [g (W)]$ for any $f \in H_{X}$ and $g \in H_{Y}$ .

Theorem 2

(Fukumizu et al., Citation2013, Theorem 2) If $C_{XX}$ is injective, $m_{Π} \in R a n (C_{XX})$ , and $E [g (Y) | X = \cdot] \in H_{X}$ for any $g \in H_{Y}$ , then(2) $\begin{matrix} m_{Q_{Y}} = C_{YX} C_{XX}^{- 1} m_{Π}, \end{matrix}$ (2)

where $R a n (C_{XX})$ denotes the range of $C_{XX}$ .

Here we have, for any $x \in X$ (3) $\begin{matrix} E [k_{Y} (\cdot, Y) | X = x] = C_{YX} C_{XX}^{- 1} k_{X} (\cdot, x) \end{matrix}$ (3)

by replacing $m_{Π}$ in Equation (2) for $k_{X} (\cdot, x)$ . It is noted in Fukumizu et al. (Citation2013) that the assumption $m_{Π} \in R a n (C_{XX})$ does not hold in general. In order to remove this assumption, ${(C_{XX} + ϵ I)}^{- 1}$ has been suggested to be used instead of $C_{XX}^{- 1}$ , where $ϵ$ is a regularization constant and I is the identity operator. Thus, the approximations of Equations (2) and (3) are respectively given by $\begin{matrix} m_{Q_{Y}}^{reg} = C_{YX} {(C_{XX} + ϵ I)}^{- 1} m_{Π} and E^{reg} [k_{Y} (\cdot, Y) | X = x] = C_{YX} {(C_{XX} + ϵ I)}^{- 1} k_{X} (\cdot, x) . \end{matrix}$

Similarly, for any $y \in Y$ , the approximation of $m_{Q_{X | y}}$ is provided by(4) $\begin{matrix} m_{Q_{X | y}}^{reg} = E^{reg} [k_{X} (\cdot, Z) | W = y] = C_{ZW} {(C_{WW} + δ I)}^{- 1} k_{Y} (\cdot, y), \end{matrix}$ (4)

where $δ$ is a regularization constant and the linear operators $C_{ZW}$ and $C_{WW}$ will be defined below.

Definition 3

We consider the kernel means $m_{Q} = m_{(Z W)}$ and $m_{(W W)}$ such that $\begin{matrix} {⟨ m_{(Z W)}, g \otimes f ⟩}_{H_{Y} \otimes H_{X}} = E [f (Z) g (W)] and {⟨ m_{(W W)}, g_{1} \otimes g_{2} ⟩}_{H_{Y} \otimes H_{Y}} = E [g_{1} (W) g_{2} (W)] \end{matrix}$

for any $f \in H_{X}$ and $g, g_{1}, g_{2} \in H_{Y}$ , where $\otimes$ denotes the tensor product. Let $C_{(Y X) X} : H_{X} \to H_{Y} \otimes H_{X}$ and $C_{(Y Y) X} : H_{X} \to H_{Y} \otimes H_{Y}$ be bounded linear operators which respectively satisfy(5) $\begin{matrix} {⟨ g \otimes f, C_{(Y X) X} h ⟩}_{H_{Y} \otimes H_{X}} & = E [g (Y) f (X) h (X)], \\ {⟨ g_{1} \otimes g_{2}, C_{(Y Y) X} f ⟩}_{H_{Y} \otimes H_{Y}} & = E [g_{1} (Y) g_{2} (Y) f (X)] \end{matrix}$ (5)

for any $f, h \in H_{X}$ and $g, g_{1}, g_{2} \in H_{Y}$ .

From Theorem 2, Fukumizu et al. (Citation2013) proposed that $m_{(Z W)}$ and $m_{(W W)}$ can be given by $\begin{matrix} m_{(Z W)} = C_{(Y X) X} C_{XX}^{- 1} m_{Π} \in H_{Y} \otimes H_{X} and m_{(W W)} = C_{(Y Y) X} C_{XX}^{- 1} m_{Π} \in H_{Y} \otimes H_{Y} . \end{matrix}$

In case $m_{Π}$ is not included in $R a n (C_{XX})$ , they suggested that $m_{(Z W)}$ and $m_{(W W)}$ could be approximated by $\begin{matrix} m_{(Z W)}^{reg} = C_{(Y X) X} {(C_{XX} + ϵ I)}^{- 1} m_{Π} and m_{(W W)}^{reg} = C_{(Y Y) X} {(C_{XX} + ϵ I)}^{- 1} m_{Π} . \end{matrix}$

Remark 1

(Fukumizu et al., Citation2013, p. 3760) $m_{(Z W)}$ and $m_{(W W)}$ can respectively be identified with $C_{ZW}$ and $C_{WW}$ .

Here, we introduce the empirical method for estimating the posterior kernel mean $m_{Q_{X | y}}$ following (Fukumizu et al., Citation2013).

Definition 4

Suppose we have an independent and identically distributed (i.i.d.) sample ${(X_{i}, Y_{i})}_{i = 1}^{n}$ from the observed distribution P on $X \times Y$ and a sample ${U_{j}}_{j = 1}^{l}$ from the prior distribution $Π$ on $X$ . The prior kernel mean $m_{Π}$ is estimated by(6) $\begin{matrix} {\hat{m}}_{Π} = \sum_{j = 1}^{l} γ_{j} k_{X} (\cdot, U_{j}), \end{matrix}$ (6)

where $γ_{1}, \dots, γ_{l}$ are weights. Let us put ${\hat{m}}_{Π} = {({\hat{m}}_{Π} (X_{1}), \dots, {\hat{m}}_{Π} (X_{n}))}^{T}$ , $G_{X} = {(k_{X} (X_{i}, X_{j}))}_{1 \leq i, j \leq n}$ , and $G_{Y} = {(k_{Y} (Y_{i}, Y_{j}))}_{1 \leq i, j \leq n}$ .

Proposition 1

(Fukumizu et al., Citation2013, Proposition 3, revised) Let $I_{n}$ denote the identity matrix of size n. The estimates of $C_{ZW}$ and $C_{WW}$ are given by $\begin{matrix} {\hat{C}}_{ZW} = \sum_{i = 1}^{n} {\hat{μ}}_{i} k_{X} (\cdot, X_{i}) \otimes k_{Y} (\cdot, Y_{i}) and {\hat{C}}_{WW} = \sum_{i = 1}^{n} {\hat{μ}}_{i} k_{Y} (\cdot, Y_{i}) \otimes k_{Y} (\cdot, Y_{i}), \end{matrix}$

respectively, where $\hat{μ} = {({\hat{μ}}_{1}, \dots, {\hat{μ}}_{n})}^{T} = {(G_{X} + n ϵ I_{n})}^{- 1} {\hat{m}}_{Π}$ .

The proof of this revised proposition is given in Section 6.1. It is suggested in Fukumizu et al. (Citation2013) that Equation (4) can be empirically estimated by $\begin{matrix} {\hat{m}}_{Q_{X | y}} = {\hat{C}}_{ZW} {({\hat{C}}_{WW}^{2} + δ I_{n})}^{- 1} {\hat{C}}_{WW} k_{Y} (\cdot, y) . \end{matrix}$

Theorem 3

(Fukumizu et al., Citation2013, Proposition 4) Given an observation $y \in Y$ , ${\hat{m}}_{Q_{X | y}}$ can be calculated by $\begin{matrix} {\hat{m}}_{Q_{X | y}} = k_{X}^{T} R_{X | Y} k_{Y} (y), R_{X | Y} = Λ G_{Y} {({(Λ G_{Y})}^{2} + δ I)}^{- 1} Λ, \end{matrix}$

where $Λ = diag (\hat{μ})$ is the diagonal matrix with the elements of $\hat{μ}$ , $k_{X} = {(k_{X} (\cdot, X_{1}), \dots, k_{X} (\cdot, X_{n}))}^{T}$ , and $k_{Y} = {(k_{Y} (\cdot, Y_{1}), \dots, k_{Y} (\cdot, Y_{n}))}^{T}$ .

If we want to know the posterior expectation of a function $f \in H_{X}$ given an observation $y \in Y$ , it is estimated by $\begin{matrix} {⟨ f, {\hat{m}}_{Q_{X | y}} ⟩}_{H_{X}} = f_{X}^{T} R_{X | Y} k_{Y} (y), \end{matrix}$

where $f_{X} = {(f (X_{1}), \dots, f (X_{n}))}^{T}$ .

3. Theoretical arguments

In this section, we theoretically support the three arguments raised in Section 1. First, we show in Section 3.1 that the posterior kernel mean ${\hat{m}}_{Q_{X | y}}$ is completely unaffected by the prior distribution $Π$ under the condition that $Λ$ and $G_{Y}$ are non-singular. This implies that, at least in some cases, $Π$ does not properly affect ${\hat{m}}_{Q_{X | y}}$ . Second, we mention in Section 3.2 that the linear operators $C_{XX}$ and $C_{WW}$ are not always surjective, and address the problems associated with the setting of the regularization parameters $ϵ$ and $δ$ . Third, we demonstrate in Section 3.3 that conditional expectation functions are not generally contained in the RKHS, which means that Theorems 1, 2, and 5–8 in Fukumizu et al. (Citation2013) do not work in some situations.

3.1. Relations between the posterior ${\hat{m}}_{Q_{X | y}}$ and the prior $Π$

Let us review Theorem 3. Assume that $G_{Y}$ and $Λ$ are non-singular matrices. (This assumption is not so strange, as shown in Section 6.2.) The matrix $R_{X | Y} = Λ G_{Y} {({(Λ G_{Y})}^{2} + δ I)}^{- 1} Λ$ tends to $G_{Y}^{- 1}$ as $δ$ tends to 0. Furthermore, if we set $δ = 0$ from the beginning, we obtain $R_{X | Y} = G_{Y}^{- 1}$ . This implies that the posterior kernel mean ${\hat{m}}_{Q_{X | y}} = k_{X}^{T} R_{X | Y} k_{Y} (y)$ never depends on the prior distribution $Π$ on $X$ , which seems to be a contradiction to the nature of Bayes’ rule.

Some readers may argue that, even in this case, we should not set $δ = 0$ . Then, however, there is ambiguity about why and how the regularization parameters are introduced in the kernel Bayes’ rule, since Fukumizu et al. originally used the regularization parameters just to solve inverse problems as an analog of ridge regression (Fukumizu et al., Citation2013, p. 3758). They seem to support the validity of the regularization parameters by Theorems 5, 6, 7, and 8 in Fukumizu et al. (Citation2013), however, these theorems do not work without the strong assumption that conditional expectation functions are included in the RKHS, as will be discussed in Section 3.2. In addition, since the theorems work only when $δ_{n}$ , etc. decay to zero sufficiently slowly, it seems that we have no principled way to choose values for the regularization parameters, except for cross-validation or similar techniques. It is worth mentioning that, in our simple experiments in Section 4.2, we could not obtain a reasonable result with the kernel Bayes’ rule using any combination of values for the regularization parameters.

3.2. The inverse of the operators $C_{XX}$ and $C_{WW}$

As noted by Fukumizu et al. (Citation2013), the linear operators $C_{XX}$ and $C_{WW}$ are not surjective in some usual cases, the proof of which is given in Section 6.3. Therefore, they proposed an alternative way of obtaining a solution $f \in H_{X}$ of the equation $C_{XX} f = m_{Π}$ , that is, a regularized inversion $f = {(C_{XX} + ϵ I)}^{- 1} m_{Π}$ as an analog of ridge regression, where $ϵ$ is a regularization parameter and I is an identity operator. One of the disadvantages of this method is that the solution $f = {(C_{XX} + ϵ I)}^{- 1} m_{Π}$ depends upon the choice of $ϵ$ . In Section 4.2, we numerically show that the prediction using the kernel Bayes’ rule considerably depends on the regularization parameters $ϵ$ and $δ$ . Theorems 5–8 in Fukumizu et al. (Citation2013) seem to support the appropriateness of the regularized inversion. However, these theorems work under the condition that conditional expectation functions are contained in the RKHS, which does not hold in some cases as proved in Section 3.3. Furthermore, since we need to assume sufficiently slow decay of the regularization constants $ϵ$ and $δ$ in these theorems, it is practically difficult to set appropriate values for $ϵ$ and $δ$ . A cross-validation procedure seems to be useful for tuning the parameters and we may obtain good experimental results, however, it seems to lack theoretical background.

Instead of the regularized inversion method, we can compute generalized inverse matrices of $G_{X}$ and $Λ G_{Y}$ , given a sample ${(X_{i}, Y_{i})}_{i = 1}^{n}$ . Below, we briefly introduce a generalization of a matrix inverse. For more details, see Horn and Johnson (Citation2013).

Definition 5

Let A be a matrix of size $m \times n$ over the complex number space $C$ . We say that a matrix $A^{\times}$ of size $n \times m$ is a generalized inverse matrix of A if $A A^{\times} A = A$ . We also say that a matrix $A^{†}$ of size $n \times m$ is the Moore-Penrose generalized inverse matrix of A if $A A^{†}$ and $A^{†} A$ are Hermitian, $A A^{†} A = A$ , and $A^{†} A A^{†} = A^{†}$ .

Remark 2

In fact, any matrix A has the Moore-Penrose generalized inverse matrix $A^{†}$ . Note that $A^{†}$ is uniquely determined by A. If A is square and non-singular, then $A^{\times} = A^{†} = A^{- 1}$ . For a generalized inverse matrix $A^{\times}$ of size $n \times m$ , $A A^{\times} v = v$ for any vector $v \in C^{m}$ if v is contained in the image of A. In particular, $A^{\times} v$ is a vector contained in the preimage of v under A.

In the calculation of ${\hat{m}}_{Q_{X | y}} = k_{X}^{T} R_{X | Y} k_{Y} (y)$ , we numerically compare the case $R_{X | Y} = {(Λ^{'} G_{Y})}^{†} Λ^{'}$ with the original case $R_{X | Y} = Λ G_{Y} {({(Λ G_{Y})}^{2} + δ I)}^{- 1} Λ$ in Section 4.2, where $Λ^{'} = diag (G_{X}^{†} {\hat{m}}_{Π})$ .

3.3. Conditional expectation functions and RKHS

In this subsection, we show that conditional expectation functions are in some cases not contained in the RKHS.

Definition 6

For $p \in [1, \infty)$ , we define the spaces $L^{p} (R)$ , $L^{p} (R, C)$ , and $L^{p} (R^{2}, R)$ as $\begin{matrix} L^{p} (R) : = \{\begin{matrix} f : R \to R | & \int_{- \infty}^{\infty} {| f (x) |}^{p} d x < \infty \end{matrix}\}, \\ L^{p} (R, C) : = {f : R \to C | \int_{- \infty}^{\infty} {| f (x) |}^{p} d x < \infty} & L^{p} (R^{2}, R) : = \{\begin{matrix} f : R^{2} \to R | & \int_{R^{2}} {| f (x_{1}, x_{2}) |}^{p} d x_{1} d x_{2} < \infty \end{matrix}\} . \end{matrix}$

We also define the $L^{p}$ norm for $f \in L^{p} (R)$ or $f \in L^{p} (R, C)$ as $\begin{matrix} {‖ f ‖}_{p} : = {(\int_{- \infty}^{\infty} {| f (x) |}^{p} d x)}^{\frac{1}{p}}, \end{matrix}$

and the $L^{p}$ norm for $f \in L^{p} (R^{2}, R)$ as $\begin{matrix} {‖ f ‖}_{p} : = {(\int_{R^{2}} {| f (x_{1}, x_{2}) |}^{p} d x_{1} d x_{2})}^{\frac{1}{p}} . \end{matrix}$

Definition 7

For a function $f \in L^{1} (R, C) \cap L^{2} (R, C)$ , we define its Fourier transform as $\begin{matrix} \hat{f} (t) : = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{\infty} f (x) exp (- \sqrt{- 1} t x) d x . \end{matrix}$

We can uniquely extend the Fourier transform to an isometry $\hat{} : L^{2} (R, C) \to L^{2} (R, C)$ . We also define the inverse Fourier transform $\overset{ˇ}{} : L^{2} (R, C) \to L^{2} (R, C)$ as an isometry uniquely determined by $\begin{matrix} \overset{ˇ}{f} (t) : = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{\infty} f (x) exp (\sqrt{- 1} t x) d x \end{matrix}$

for $f \in L^{1} (R, C) \cap L^{2} (R, C)$ .

Definition 8

Let us define a Gaussian kernel $k_{G}$ on $R$ by $\begin{matrix} k_{G} (x, y) : = \frac{1}{\sqrt{2 π} σ} exp (- \frac{{(x - y)}^{2}}{2 σ^{2}}) . \end{matrix}$

As described in Fukumizu (Citation2014), the RKHS of real-valued functions and complex-valued functions corresponding to the positive definite kernel $k_{G}$ are given by $\begin{matrix} H_{G} & : = \{\begin{matrix} f \in L^{2} (R) & \int_{- \infty}^{\infty} {|\hat{f} (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t < \infty \end{matrix}\}, \\ H_{G} (R, C) & : = \{\begin{matrix} f \in L^{2} (R, C) & \int_{- \infty}^{\infty} {|\hat{f} (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t < \infty \end{matrix}\}, \end{matrix}$

respectively, and the inner product of $f, g \in H_{G}$ or $f, g \in H_{G} (R, C)$ on the RKHS is calculated by(7) $\begin{matrix} ⟨ f, g ⟩ = \int_{- \infty}^{\infty} \hat{f} (t) \bar{\hat{g} (t)} exp (\frac{σ^{2}}{2} t^{2}) d t, \end{matrix}$ (7)

where the overline denotes the complex conjugate. Remark that $H_{G}$ is a real Hilbert subspace contained in the complex Hilbert space $H_{G} (R, C)$ .

Fukumizu et al. (Citation2013) mentioned that the conditional expectation function $E [g (Y) | X = \cdot]$ is not always included in $H_{X}$ . Indeed, if the variables X and Y are independent, then $E [g (Y) | X = \cdot]$ becomes a constant function on $X$ , the value of which might be non-zero. In the case that $X = R$ and $k_{X} = k_{G}$ , the constant function with non-zero value is not contained in $H_{X} = H_{G}$ .

Additionally, in order to prove Theorems 5 and 8 in Fukumizu et al. (Citation2013), they made the assumption that $E [k_{Y} (Y, \tilde{Y}) | X = x, \tilde{X} = \tilde{x}] \in H_{X} \otimes H_{X}$ and $E [k_{X} (Z, \tilde{Z}) | W = y, \tilde{W} = \tilde{y}] \in H_{Y} \otimes H_{Y}$ , where $(\tilde{X}, \tilde{Y})$ and $(\tilde{Z}, \tilde{W})$ are independent copies of the random variables $(X, Y)$ and $(Z, W)$ on $X \times Y$ , respectively. We also see that this assumption does not hold in general. Suppose that X and Y are independent and that so are $\tilde{X}$ and $\tilde{Y}$ . Then $E [k_{Y} (Y, \tilde{Y}) ∣ X = x, \tilde{X} = \tilde{x}]$ is a constant function of $(x, \tilde{x})$ , the value of which might be non-zero. In the case that $X = R$ and $k_{X} = k_{G}$ , the constant function having non-zero value is not contained in $H_{X} \otimes H_{X} = H_{G} \otimes H_{G}$ . Note that $H_{G} \otimes H_{G}$ is isomorphic to the RKHS corresponding to the kernel $k ((x_{1}, x_{2}), ({\tilde{x}}_{1}, {\tilde{x}}_{2})) = k_{G} (x_{1}, {\tilde{x}}_{1}) k_{G} (x_{2}, {\tilde{x}}_{2})$ on $R^{2}$ , that is, $\begin{matrix} H_{G} \otimes H_{G} = \{\begin{matrix} f \in L^{2} (R^{2}, R) & \int_{R^{2}} {|\hat{f} (t_{1}, t_{2})|}^{2} exp (\frac{σ^{2}}{2} (t_{1}^{2} + t_{2}^{2})) d t_{1} d t_{2} < \infty \end{matrix}\}, \end{matrix}$

where the Fourier transform of $f : R^{2} \to R$ is defined by $\begin{matrix} \hat{f} (t_{1}, t_{2}) : = lim_{n \to \infty} {(\frac{1}{\sqrt{2 π}})}^{2} \int_{x_{1}^{2} + x_{2}^{2} < n} f (x_{1}, x_{2}) exp (- \sqrt{- 1} (t_{1} x_{1} + t_{2} x_{2})) d x_{1} d x_{2} . \end{matrix}$

Thus, the assumption that conditional expectation functions are included in the RKHS does not hold in general. Since most of the theorems in Fukumizu et al. (Citation2013) require this assumption, the kernel Bayes’ rule may not work in several cases.

4. Numerical experiments

In this section, we perform numerical experiments to illustrate the theoretical results in Sections 3.1 and 3.2. We first introduce probabilistic classifiers in Section 4.1 based on conventional Bayes’ rule assuming Gaussian distributions (BR), the original kernel Bayes’ rule (KBR1), and the kernel Bayes’ rule using Moore-Penrose generalized inverse matrices (KBR2). In Section 4.2, we apply the three classifiers to a binary classification problem with computer-simulated data sets. Numerical experiments are implemented in version 2.7.6 of the Python software (Python Software Foundation, Wolfeboro Falls, NH, USA).

4.1. Algorithms of the three classifiers, BR, KBR1, and KBR2

Let $(X, Y)$ be a random variable with a distribution P on $X \times Y$ , where $X = {C_{1}, \dots, C_{g}}$ is a family of classes and $Y = R^{d}$ . Let $Π$ and Q be the prior and the joint distributions on $X$ and $X \times Y$ , respectively. Suppose we have an i.i.d. training sample ${(X_{i}, Y_{i})}_{i = 1}^{n}$ from the distribution P. The aim of this subsection is to derive algorithms of the three classifiers, BR, KBR1, and KBR2, which respectively calculate the posterior probability for each class given an observation $y \in Y$ , that is, $Q_{X | y} (C_{1}), \dots, Q_{X | y} (C_{g})$ .

4.1.1. The algorithm of BR

In BR, we estimate the posterior probability of j-th class $(j = 1, \dots, g)$ given a test value $y \in Y$ by $\begin{matrix} {\hat{Q}}_{X | y} (C_{j}) = \frac{{\hat{P}}_{Y | C_{j}} (y) Π (C_{j})}{\sum_{k = 1}^{g} {\hat{P}}_{Y | C_{k}} (y) Π (C_{k})}, \end{matrix}$

where ${\hat{P}}_{Y | C_{j}} (\cdot)$ is the density function of the d-dimensional normal distribution $N ({\hat{M}}_{j}, {\hat{S}}_{j})$ defined by $\begin{matrix} {\hat{P}}_{Y | C_{j}} (\cdot) = \frac{1}{\sqrt{{(2 π)}^{d} |{\hat{S}}_{j}|}} exp (- \frac{1}{2} {(\cdot - {\hat{M}}_{j})}^{T} {\hat{S}}_{j}^{- 1} (\cdot - {\hat{M}}_{j})) . \end{matrix}$

The mean vector ${\hat{M}}_{j} \in R^{d}$ and the covariance matrix ${\hat{S}}_{j} \in R^{d \times d}$ are calculated from the training data of the class $C_{j}$ .

4.1.2. The algorithm of KBR1

Let us define positive definite kernels $k_{X}$ and $k_{Y}$ as $\begin{matrix} k_{X} (X, X^{'}) = \{\begin{matrix} 1 & (X = X^{'}) \\ 0 & (X \neq X^{'}) \end{matrix} and k_{Y} (Y, Y^{'}) = \frac{1}{\sqrt{2 π} σ} exp (- \frac{{∥Y - Y^{'}∥}^{2}}{2 σ^{2}}) \end{matrix}$

for $X, X^{'} \in X$ and $Y, Y^{'} \in Y$ , and the corresponding RKHS as $H_{X}$ and $H_{Y}$ , respectively. Here we set $∥Y∥ = \sqrt{\sum_{i = 1}^{d} y_{i}^{2}}$ for $Y = {(y_{1}, y_{2}, \dots, y_{d})}^{T} \in Y = R^{d}$ . Then, the prior kernel mean is given by $\begin{matrix} {\hat{m}}_{Π} (\cdot) = \sum_{j = 1}^{g} Π (C_{j}) k_{X} (\cdot, C_{j}), \end{matrix}$

where $\sum_{j = 1}^{g} Π (C_{j}) = 1$ . Let us put $G_{X} = {(k_{X} (X_{i}, X_{j}))}_{1 \leq i, j \leq n}$ , $G_{Y} = {(k_{Y} (Y_{i}, Y_{j}))}_{1 \leq i, j \leq n}$ , $D = {(1_{{C_{i}}} (X_{j}))}_{1 \leq i \leq g, 1 \leq j \leq n} \in {0, 1}^{g \times n}$ , ${\hat{m}}_{Π} = {({\hat{m}}_{Π} (X_{1}), \dots, {\hat{m}}_{Π} (X_{n}))}^{T}$ , $\hat{μ} = {({\hat{μ}}_{1}, \dots, {\hat{μ}}_{n})}^{T} = {(G_{X} + n ϵ I_{n})}^{- 1} {\hat{m}}_{Π}$ , $Λ = diag (\hat{μ})$ , $k_{X} (\cdot) = {(k_{X} (\cdot, X_{1}), \dots, k_{X} (\cdot, X_{n}))}^{T}$ , $k_{Y} (\cdot) = {(k_{Y} (\cdot, Y_{1}), \dots, k_{Y} (\cdot, Y_{n}))}^{T}$ , and $R_{X | Y} = Λ G_{Y} {({(Λ G_{Y})}^{2} + δ I_{n})}^{- 1} Λ$ , where $I_{n}$ is the identity matrix of size n and $ϵ, δ \in R$ are heuristically set regularization parameters. Note that $1_{A}$ stands for the indicator function of a set A described as $\begin{matrix} 1_{A} (t) : = \{\begin{matrix} 1 & (t \in A) \\ 0 & (t \notin A) \end{matrix} . \end{matrix}$

Following Theorem 3, the posterior kernel mean given a test value $y \in Y$ is estimated by $\begin{matrix} {\hat{m}}_{Q_{X | y}} = k_{X}^{T} R_{X | Y} k_{Y} (y) . \end{matrix}$

Here, we estimate the posterior probabilities for classes given a test value $y \in Y$ by $\begin{matrix} (\begin{matrix} {\hat{Q}}_{X | y} (C_{1}) \\ ⋮ \\ {\hat{Q}}_{X | y} (C_{g}) \end{matrix}) = (\begin{matrix} {〈1_{{C_{1}}}, {\hat{m}}_{Q_{X | y}}〉}_{H_{X}} \\ ⋮ \\ {〈1_{{C_{g}}}, {\hat{m}}_{Q_{X | y}}〉}_{H_{X}} \end{matrix}) = D R_{X | Y} k_{Y} (y) . \end{matrix}$

4.1.3. The algorithm of KBR2

Let $G_{X}^{†}$ denote the Moore-Penrose generalized inverse matrix of $G_{X}$ . Let us put ${\hat{μ}}^{'} = {({\hat{μ}}_{1}^{'}, \dots, {\hat{μ}}_{n}^{'})}^{T} = G_{X}^{†} {\hat{m}}_{Π}$ , $Λ^{'} = diag ({\hat{μ}}^{'})$ , and $R_{X | Y}^{'} = {(Λ^{'} G_{Y})}^{†} Λ^{'}$ . Replacing $R_{X | Y}$ in Section 4.1.2 for $R_{X | Y}^{'}$ , the posterior probabilities for classes given a test value $y \in Y$ is estimated by $\begin{matrix} {({\hat{Q}}_{X | y} (C_{1}), \dots, {\hat{Q}}_{X | y} (C_{g}))}^{T} = D R_{X | Y}^{'} k_{Y} (y) . \end{matrix}$

4.2. Probabilistic predictions by the three classifiers

Here, we apply the three classifiers defined in Section 4.1 to a binary classification problem using computer-simulated data-sets, where $X = {C_{1}, C_{2}}$ and $Y = R^{2}$ . In the first step, we independently generate 100 sets of training samples with each training sample being ${(X_{i}, Y_{i}) \in X \times Y}_{i = 1}^{100}$ , where $X_{i} = C_{1}$ and $Y_{i} \sim N (M_{1}, S_{1})$ if $1 \leq i \leq 50$ , $X_{i} = C_{2}$ and $Y_{i} \sim N (M_{2}, S_{2})$ if $51 \leq i \leq 100$ , $M_{1} = {(1, 0)}^{T}$ , $M_{2} = {(0, 1)}^{T}$ , and $S_{1} = S_{2} = diag (0.1, 0.1)$ . Here, ${Y_{i}}_{i = 1}^{50}$ and ${Y_{i}}_{i = 51}^{100}$ are sampled i.i.d. from $N (M_{1}, S_{1})$ and $N (M_{2}, S_{2})$ , respectively. Individual Y-values of one of the training samples are plotted in Figure .

Figure 1. Individual Y-values of a training sample.

With each of the 100 training samples and a simulated prior probability of $C_{1}$ , or $Π (C_{1}) \in {0.1, 0.2, \dots, 0.9}$ , the classifiers defined in Section 4.1 estimate the posterior probability of $C_{1}$ given a test value $y \in {(0.5, 0.5), (0.6, 0.4), (0.7, 0.3)}$ , that is, $Q_{X | y} (C_{1})$ . Figures – show the mean (plus or minus standard error of the mean, SEM) of the 100 values of ${\hat{Q}}_{X | y} (C_{1})$ calculated by each of the classifiers, BR, KBR1, and KBR2. Here we show the case where $σ$ in KBR1 and KBR2 is fixed to 0.1, and the regularization parameters of KBR1 are set to be $ϵ = δ = 10^{- 7}$ (Figure ), $ϵ = δ = 10^{- 5}$ (Figure ), $ϵ = δ = 10^{- 3}$ (Figure ), and $ϵ = δ = 10^{- 1}$ (Figure ). In Figures –, BR_th represents the theoretical value of BR, which coincides with BR if the parameters ${\hat{M}}_{1}$ , ${\hat{M}}_{2}$ , ${\hat{S}}_{1}$ , and ${\hat{S}}_{2}$ are set to be $M_{1}$ , $M_{2}$ , $S_{1}$ , and $S_{2}$ , respectively.

Figure 2. The case $ϵ = δ = 10^{- 7}$ .

Consistent to Section 3.1, ${\hat{Q}}_{X | y} (C_{1})$ calculated by KBR1 is poorly influenced by $Π (C_{1})$ compared with that by BR when $ϵ$ and $δ$ are set to be small (see Figures and ). In addition, ${\hat{Q}}_{X | y} (C_{1})$ calculated by KBR2 also seems to be uninfluenced by $Π (C_{1})$ . When $ϵ$ and $δ$ are set to be larger, the effect of $Π (C_{1})$ on ${\hat{Q}}_{X | y} (C_{1})$ becomes apparent in KBR1, however, the value of ${\hat{Q}}_{X | y} (C_{1})$ becomes too small (see Figures and ). These results suggest that in the kernel Bayes’ rule, the posterior does not depend on the prior if $ϵ$ and $δ$ are negligible, which might be a contradiction to the nature of Bayes’ theorem. Moreover, even though the prior affects the posterior when $ϵ$ and $δ$ become larger, the posterior seems too much dependent on $ϵ$ and $δ$ , which are initially defined just for the regularization of matrices.

We have also tested all possible combinations of the following values for the parameters in KBR1 and/or KBR2: $ϵ \in {10^{- 1}, 10^{- 3}, 10^{- 5}, 10^{- 7}, 10^{- 9}, 10^{- 11}, 10^{- 13}, 10^{- 15}}$ , $δ \in {10^{- 1}, 10^{- 3}, 10^{- 5}, 10^{- 7}, 10^{- 9}, 10^{- 11}, 10^{- 13}, 10^{- 15}}$ , and $σ \in {0.01, 0.1, 1, 10, 100}$ . All the experimental results have been evaluated in a similar manner as above, and none of the results are found to be reasonable in the context of Bayesian inference (see Supplementary material).

Figure 3. The case $ϵ = δ = 10^{- 5}$ .

Figure 4. The case $ϵ = δ = 10^{- 3}$ .

Figure 5. The case $ϵ = δ = 10^{- 1}$ .

5. Conclusions

One of the important features of Bayesian inference is that it provides a reasonable way of updating the probability for a hypothesis as additional evidence is acquired. The kernel Bayes’ rule has been expected to enable Bayesian inference in RKHS. In other words, the posterior kernel mean has been considered to be reasonably estimated by the kernel Bayes’ rule, given kernel mean expressions of the prior and likelihood. What is “reasonable" depends on circumstances, however, some of the results in this paper seem to show obviously unreasonable aspects of the kernel Bayes’ rule, at least in the context of Bayesian inference.

First, as shown in Section 3.1, when $Λ$ and $G_{Y}$ are non-singular matrices and so we set $δ = 0$ , the posterior kernel mean ${\hat{m}}_{Q_{X | y}}$ is entirely unaffected by the prior distribution $Π$ on $X$ . This means that, in Bayesian inference with the kernel Bayes’ rule, prior beliefs are in some cases completely neglected in calculating the kernel mean of the posterior distribution. Numerical evidence is also presented in Section 4.2. When the regularization parameters $ϵ$ and $δ$ are set to be small, the posterior probability calculated by the kernel Bayes’ rule (KBR1) is almost unaffected by the prior probability in comparison with that by conventional Bayes’ rule (BR). Consistently, when the regularized inverse matrices in KBR1 are replaced for the Moore-Penrose generalized inverse matrices (KBR2), the posterior probability is also uninfluenced by the prior probability, which seems to be unsuitable in the context of Bayesian updating of a probability distribution.

Second, as discussed in Sections 3.2 and 4.2, the posterior estimated by the kernel Bayes’ rule considerably depends upon the regularization parameters $ϵ$ and $δ$ , which are originally introduced just for the regularization of matrices. A cross-validation approach is proposed in Fukumizu et al. (Citation2013) to search for the optimal values of the parameters. However, theoretical foundations seem to be insufficient for the correct tuning of the parameters. Furthermore, in our experimental settings, we are not able to obtain a reasonable result using any combination of the parameter values, suggesting the possibility that there are no appropriate values for the parameters in general. Thus, we consider it difficult to solve the problem that $C_{XX}$ and $C_{WW}$ are not surjective by just adding regularization parameters.

Third, as shown in Section 3.3, the assumption that conditional expectation functions are included in the RKHS does not hold in general. Since this assumption is necessary for most of the theorems in Fukumizu et al. (Citation2013), we believe that the assumption itself may need to be reconsidered.

In summary, even though current research efforts are focused on the application of the kernel Bayes’ rule (Fukumizu et al., Citation2013; Kanagawa et al., Citation2014), it might be necessary to reexamine its basic framework of combining new evidence with prior beliefs.

6. Proofs

In this section, we provide some proofs for Sections 2 and 3.

6.1. Estimation of $C_{ZW}$ and $C_{WW}$

Here we give the proof of Proposition 1.

Proof

Let ${\hat{C}}_{XX}$ , ${\hat{C}}_{(Y X) X}$ , and ${\hat{C}}_{(Y Y) X}$ denote the estimates of $C_{XX}$ , $C_{(Y X) X}$ , and $C_{(Y Y) X}$ , respectively. We define the estimates of $m_{(Z W)}$ and $m_{(W W)}$ as $\begin{matrix} {\hat{m}}_{(Z W)} = {\hat{C}}_{(Y X) X} {\hat{C}}_{XX}^{- 1} {\hat{m}}_{Π} and {\hat{m}}_{(W W)} = {\hat{C}}_{(Y Y) X} {\hat{C}}_{XX}^{- 1} {\hat{m}}_{Π}, \end{matrix}$

and put $h = {\hat{C}}_{XX}^{- 1} {\hat{m}}_{Π} \in H_{X}$ . According to Equation (5), for any $f \in H_{X}$ and $g \in H_{Y}$ , $\begin{matrix} {〈{\hat{m}}_{(Z W)}, g \otimes f〉}_{H_{Y} \otimes H_{X}} = {〈{\hat{C}}_{(Y X) X} h, g \otimes f〉}_{H_{Y} \otimes H_{X}} = \hat{E} [f (X) g (Y) h (X)] \\ = \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}) g (Y_{i}) h (X_{i}) = {〈\frac{1}{n} \sum_{i = 1}^{n} h (X_{i}) k_{X} (\cdot, X_{i}) \otimes k_{Y} (\cdot, Y_{i}), f \otimes g〉}_{H_{X} \otimes H_{Y}}, \end{matrix}$

where $\hat{E} [\cdot]$ represents the empirical expectation operator. Thus, from Remark 1,(8) $\begin{matrix} {\hat{C}}_{ZW} = {\hat{m}}_{(Z W)} = \frac{1}{n} \sum_{i = 1}^{n} h (X_{i}) k_{X} (\cdot, X_{i}) \otimes k_{Y} (\cdot, Y_{i}) . \end{matrix}$ (8)

Similarly, for any $g_{1}, g_{2} \in H_{Y}$ , $\begin{matrix} {〈{\hat{m}}_{(W W)}, g_{1} \otimes g_{2}〉}_{H_{Y} \otimes H_{Y}} = {〈{\hat{C}}_{(Y Y) X} h, g_{1} \otimes g_{2}〉}_{H_{Y} \otimes H_{Y}} = \hat{E} [g_{1} (Y) g_{2} (Y) h (X)] \\ = \frac{1}{n} \sum_{i = 1}^{n} g_{1} (Y_{i}) g_{2} (Y_{i}) h (X_{i}) = {〈\frac{1}{n} \sum_{i = 1}^{n} h (X_{i}) k_{Y} (\cdot, Y_{i}) \otimes k_{Y} (\cdot, Y_{i}), g_{1} \otimes g_{2}〉}_{H_{Y} \otimes H_{Y}} . \end{matrix}$

Thus, from Remark 1,(9) $\begin{matrix} {\hat{C}}_{WW} = {\hat{m}}_{(W W)} = \frac{1}{n} \sum_{i = 1}^{n} h (X_{i}) k_{Y} (\cdot, Y_{i}) \otimes k_{Y} (\cdot, Y_{i}) . \end{matrix}$ (9)

Next, we will derive $h (X_{1}), \dots, h (X_{n})$ . Since $C_{XX}$ is a self-adjoint operator, $\begin{matrix} {〈h, {\hat{C}}_{XX} f〉}_{H_{X}} & = {〈{\hat{C}}_{XX} h, f〉}_{H_{X}} = {〈{\hat{m}}_{Π}, f〉}_{H_{X}} = \sum_{j = 1}^{l} γ_{j} f (U_{j}) \end{matrix}$

for any $f \in H_{X}$ . On the other hand, from Equation (1), $\begin{matrix} {〈h, {\hat{C}}_{XX} f〉}_{H_{X}} = \hat{E} [f (X) h (X)] = \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}) h (X_{i}) \end{matrix}$

for any $f \in H_{X}$ . Hence, we have(10) $\begin{matrix} \sum_{j = 1}^{l} γ_{j} f (U_{j}) = \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}) h (X_{i}) \end{matrix}$ (10)

for any $f \in H_{X}$ . Replacing f in Equation (10) for $k_{X} (X_{1}, \cdot), \dots, k_{X} (X_{n}, \cdot) \in H_{X}$ , we have(11) $\begin{matrix} (\begin{matrix} k_{X} (X_{1}, U_{1}) & \dots & k_{X} (X_{1}, U_{l}) \\ ⋮ & ⋱ & ⋮ \\ k_{X} (X_{n}, U_{1}) & \dots & k_{X} (X_{n}, U_{l}) \end{matrix}) (\begin{matrix} γ_{1} \\ ⋮ \\ γ_{l} \end{matrix}) = \frac{1}{n} G_{X} (\begin{matrix} h (X_{1}) \\ ⋮ \\ h (X_{n}) \end{matrix}) . \end{matrix}$ (11)

Using Equation (6), the left hand side of Equation (11) is given by $\begin{matrix} (\begin{matrix} {〈\sum_{j = 1}^{l} γ_{j} k_{X} (\cdot, U_{j}), k_{X} (\cdot, X_{1})〉}_{H_{X}} \\ ⋮ \\ {〈\sum_{j = 1}^{l} γ_{j} k_{X} (\cdot, U_{j}), k_{X} (\cdot, X_{n})〉}_{H_{X}} \end{matrix}) = (\begin{matrix} \sum_{j = 1}^{l} γ_{j} k_{X} (X_{1}, U_{j}) \\ ⋮ \\ \sum_{j = 1}^{l} γ_{j} k_{X} (X_{n}, U_{j}) \end{matrix}) = (\begin{matrix} {\hat{m}}_{Π} (X_{1}) \\ ⋮ \\ {\hat{m}}_{Π} (X_{n}) \end{matrix}) . \end{matrix}$

Therefore, we have $\begin{matrix} \frac{1}{n} (\begin{matrix} h (X_{1}) \\ ⋮ \\ h (X_{n}) \end{matrix}) = G_{X}^{- 1} (\begin{matrix} {\hat{m}}_{Π} (X_{1}) \\ ⋮ \\ {\hat{m}}_{Π} (X_{n}) \end{matrix}) \approx {(G_{X} + n ϵ I)}^{- 1} {\hat{m}}_{Π} = \hat{μ} . \end{matrix}$

Replacing $\frac{1}{n} {(h (X_{1}), \dots, h (X_{n}))}^{T}$ for $\hat{μ} = {({\hat{μ}}_{1}, \dots, {\hat{μ}}_{n})}^{T}$ , Equations (8) and (9) become $\begin{matrix} {\hat{C}}_{ZW} = \sum_{i = 1}^{n} {\hat{μ}}_{i} k_{X} (\cdot, X_{i}) \otimes k_{Y} (\cdot, Y_{i}) and {\hat{C}}_{WW} = \sum_{i = 1}^{n} {\hat{μ}}_{i} k_{Y} (\cdot, Y_{i}) \otimes k_{Y} (\cdot, Y_{i}), \end{matrix}$

respectively. $□$

6.2. Non-singularity of $G_{Y}$ and $Λ$

Here we show that the assumption in Section 3.1 holds under reasonable conditions.

Definition 9

Let f be a real-valued function defined on a non-empty open domain $Dom (f) \subseteq R^{d}$ . We say that f is analytic if f can be described by a Taylor expansion on a neighborhood of each point of $Dom (f)$ .

Proposition 2

Let k be a positive definite kernel on $R^{d}$ . Let $ν$ be a probability measure on $R^{d}$ which is absolutely continuous with respect to Lebesgue measure. Assume that k is an analytic function on $R^{d} \times R^{d}$ and that the RKHS corresponding to k is infinite dimensional. Then for any i.i.d. random variables $X_{1}, X_{2}, \dots, X_{n}$ with the same distribution $ν$ , the Gram matrix $G_{X} = {(k (X_{i}, X_{j}))}_{1 \leq i, j \leq n}$ is non-singular almost surely with respect to $ν^{n} = ν \times ν \times \dots \times ν (n times)$ .

Proof

Let us put $f (x_{1}, x_{2}, \dots, x_{n}) : = det {(k (x_{i}, x_{j}))}_{1 \leq i, j \leq n}$ . Since the RKHS corresponding to k is infinite dimensional, there are $ξ_{1}, ξ_{2}, \dots, ξ_{n} \in R^{d}$ such that ${k (\cdot, ξ_{i})}_{1 \leq i \leq n}$ are linearly independent. Then $f (ξ_{1}, ξ_{2}, \dots, ξ_{n}) \neq 0$ and hence f is a non-zero analytic function. Note that any non-trivial subvarieties of the euclidean spaces defined by analytic functions have Lebesgue measure zero. By this fact, the subvariety $\begin{matrix} V (f) : = \{\begin{matrix} (x_{1}, x_{2}, \dots, x_{n}) \in {(R^{d})}^{n} & f (x_{1}, x_{2}, \dots, x_{n}) = 0 \end{matrix}\} \subset {(R^{d})}^{n} \end{matrix}$

has Lebesgue measure zero. Since $ν$ is absolutely continuous, $ν^{n} (V (f)) = 0$ . This completes the proof. $□$

From Proposition 2, we easily obtain the following corollary.

Corollary 1

Let k be a Gaussian kernel on $R^{d}$ and let $X_{1}, X_{2}, \dots, v X_{n}$ be i.i.d. random variables with the same normal distribution on $R^{d}$ . Then the Gram matrix $G_{X} = {(k (X_{i}, X_{j}))}_{1 \leq i, j \leq n}$ is non-singular almost surely.

Proposition 3

Let k be a positive definite kernel on $X = R^{d}$ , $ν$ a probability measure on $X$ which is absolutely continuous with respect to Lebesgue measure. Assume that k is an analytic function on $X \times X$ and that the RKHS $H$ corresponding to k is infinite dimensional. Then for any $(ϵ, γ_{1}, γ_{2}, \dots, γ_{l}, U_{1}, U_{2}, \dots, U_{l}) \in R_{+} \times R^{l} \times {(R^{d})}^{l}$ except Lebesgue measure zero, and for any i.i.d. random variables $X_{1}, X_{2}, \dots, X_{n}$ with the same distribution $ν$ , each $μ_{i}$ for $i = 1, 2, \dots, n$ is (defined almost surely and) non-zero almost surely, where ${(μ_{1}, μ_{2}, \dots, μ_{n})}^{T} = {(G_{X} + n ϵ I_{n})}^{- 1} {\hat{m}}_{Π}$ , ${\hat{m}}_{Π} = {({\hat{m}}_{Π} (X_{1}), {\hat{m}}_{Π} (X_{2}), \dots, {\hat{m}}_{Π} (X_{n}))}^{T}$ , and ${\hat{m}}_{Π} (\cdot) = \sum_{j = 1}^{l} γ_{j} k (\cdot, U_{j})$ . Here $R_{+}$ denotes the set of positive real numbers.

Proof

Let us put $S : = R_{+} \times R^{l} \times {(R^{d})}^{l}$ , $T : = X^{n} \times S$ , and $\begin{matrix} f_{i} (x_{1}, x_{2}, \dots, x_{n}, ϵ, γ_{1}, γ_{2}, \dots, γ_{l}, U_{1}, U_{2}, \dots, U_{l}) : = μ_{i} (i = 1, 2, \dots, n) \end{matrix}$

for $(x_{1}, x_{2}, \dots, x_{n}) \in X^{n}$ and $(ϵ, γ_{1}, γ_{2}, \dots, γ_{l}, U_{1}, U_{2}, \dots, U_{l}) \in S$ . Since the RKHS corresponding to k is infinite- dimensional, we can obtain $ξ_{1}, ξ_{2}, \dots, ξ_{n} \in X$ such that ${k (\cdot, ξ_{i})}_{1 \leq i \leq n}$ are linearly independent. The Gram matrix ${(k (ξ_{i}, ξ_{j}))}_{1 \leq i, j \leq n} = {({⟨ k (\cdot, ξ_{i}), k (\cdot, ξ_{j}) ⟩}_{H})}_{1 \leq i, j \leq n}$ is positive definite, and its eigenvalues are all positive. Hence $det ({(k (ξ_{i}, ξ_{j}))}_{1 \leq i, j \leq n} + n ϵ I_{n}) > 0$ for each $ϵ \in R_{+}$ , and $det (G_{X} + n ϵ I_{n}) = det ({(k (x_{i}, x_{j}))}_{1 \leq i, j \leq n} + n ϵ I_{n})$ is a non-zero analytic function on $X^{n}$ for each $ϵ \in R_{+}$ .

For $(ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l}) \in S$ , let us define a closed measure-zero set $V (ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l}) : = {(x_{1}, x_{2}, \dots, x_{n}) \in X^{n} ∣ det (G_{X} + n ϵ I_{n}) = 0} \subset X^{n}$ . Then $f_{i} (*, ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l})$ is defined on $X^{n} \ V (ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l})$ for each $i \in {1, 2, \dots, n}$ . Using Cramer’s rule, $\begin{matrix} μ_{i} = \frac{det (η_{1}, η_{2}, \dots, η_{i - 1}, {\hat{m}}_{Π}, η_{i + 1}, \dots, η_{n})}{det (G_{X} + n ϵ I_{n})}, \end{matrix}$

where $η_{m}$ stands for the m-th column vector of $G_{X} + n ϵ I_{n}$ . Here we denote by $g_{i}$ the numerator of $μ_{i}$ , that is, $g_{i} = μ_{i} det (G_{X} + n ϵ I_{n})$ .

It is easy to see that $g_{i} (ξ_{1}, ξ_{2}, \dots, ξ_{n}, *)$ is a non-zero analytic function of $*$ on $S$ . Indeed, if $ϵ \to + 0$ , $U_{1} = ξ_{i}$ , $γ_{1} = 1$ , and $γ_{2} = γ_{3} = \dots = γ_{l} = 0$ , then $g_{i} \to det {({⟨ k (\cdot, ξ_{i}), k (\cdot, ξ_{j}) ⟩}_{H})}_{1 \leq i, j \leq n} \neq 0$ . Hence $Z_{i} : = {* \in S ∣ g_{i} (ξ_{1}, ξ_{2}, \dots, ξ_{n}, *) = 0}$ is a closed subset of $S$ with Lebesgue measure zero for each $i \in {1, 2, \dots, n}$ . For any $(ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l}) \in S \ (\cup_{i = 1}^{n} Z_{i})$ , $\begin{matrix} F_{i} (ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l}) : = \{\begin{matrix} * \in X^{n} & g_{i} (*, ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l}) = 0 \end{matrix}\} \end{matrix}$

is a closed subset of $X^{n}$ with Lebesgue measure zero for each $i \in {1, 2, \dots, n}$ , since $g_{i} (*, ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l})$ is a non-zero analytic function of $*$ on $X^{n}$ . Therefore, $μ_{i} = f_{i} (*, ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l})$ is defined and non-zero for $i = 1, 2, \dots, n$ and for $* \in X^{n} \ (V (ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l}) \cup (\cup_{j = 1}^{n} F_{j} (ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l})))$ if $(ϵ, γ_{1}, \dots, γ_{l}, U_{1}, \dots, U_{l}) \in S \ (\cup_{i = 1}^{n} Z_{i})$ . This completes the proof. $□$

The following corollary directly follows from Proposition 3.

Corollary 2

Let k be a Gaussian kernel on $R^{d}$ and let $X_{1}, X_{2}, \dots, X_{n}$ be i.i.d. random variables with the same normal distribution on $R^{d}$ . All other notations are as in Proposition 3. Then $Λ : = diag (μ_{1}, μ_{2}, \dots, μ_{n})$ is non-singular almost surely for any $(ϵ, γ_{1}, γ_{2}, \dots, γ_{l}, U_{1}, U_{2}, \dots, U_{l}) \in R_{+} \times R^{l} \times {(R^{d})}^{l}$ except for those in a set of Lebesgue measure zero.

6.3. Non-surjectivity of $C_{XX}$ and $C_{WW}$

The covariance operators $C_{XX}$ and $C_{WW}$ are not surjective in general. This can be verified by the fact that they are compact operators. (If the operators are surjective on the corresponding RKHS which is infinite-dimensional, then they cannot be compact because of the open mapping theorem.) Here we present some easy examples where $C_{XX}$ and $C_{WW}$ are not surjective. Let us consider for simplicity the case $X = R$ . Let X be a random variable on $R$ with a normal distribution $N (μ, σ_{0}^{2})$ . We prove that $C_{XX}$ is not surjective under the usual assumption that the positive definite kernel on $R$ is Gaussian. In order to demonstrate this, we use the symbols defined in Section 3.3 and several proven results on function spaces and Fourier transforms (see Rudin, Citation1987, for example). Note that the following three propositions are introduced without proofs.

Proposition 4

Let us put $f (x) = exp (- (a x^{2} + b x + c))$ for $a, b, c \in R$ , where $a > 0$ . Then $\begin{matrix} \hat{f} (t) = \frac{1}{\sqrt{2 a}} exp (- \frac{t^{2} - 2 \sqrt{- 1} b t - b^{2} + 4 a c}{4 a}) . \end{matrix}$

Proposition 5

For $f \in L^{2} (R, C)$ , $\hat{\bar{f}} (t) = \bar{\hat{f} (- t)}$ almost everywhere. In particular, if $f \in L^{2} (R)$ , then $\bar{\hat{f} (t)} = \hat{f} (- t)$ almost everywhere.

Proposition 6

For $f \in L^{2} (R, C)$ , put $f_{a} (x) : = f (x - a)$ . Then ${\hat{f}}_{a} (t) = exp (- \sqrt{- 1} a t) \hat{f} (t)$ .

Definition 10

Let $p (\cdot)$ denote the density function of the normal distribution $N (μ, σ_{0}^{2})$ on $R$ , that is, $\begin{matrix} p (\cdot) = \frac{1}{\sqrt{2 π} σ_{0}} exp (- \frac{{(\cdot - μ)}^{2}}{2 σ_{0}^{2}}) . \end{matrix}$

Let X be a random variable on $R$ with $N (μ, σ_{0}^{2})$ . The linear operator $C_{XX} : H_{G} \to H_{G}$ is defined by ${⟨ C_{XX} f, g ⟩}_{H_{G}} = E [f (X) g (X)]$ for any $f, g \in H_{G}$ , which is also described as $\begin{matrix} (C_{XX} f) (\cdot) = \int_{- \infty}^{\infty} f (x) k (\cdot, x) p (x) d x \end{matrix}$

for any $f \in H_{G}$ .

Proposition 7

If $f, g \in H_{G}$ , then ${〈f, g〉}_{H_{G}} \in R$

Proof

From Proposition 5, $\bar{\hat{f} (t)} = \hat{f} (- t)$ and $\bar{\hat{g} (t)} = \hat{g} (- t)$ for any $f, g \in H_{G}$ . Then, using Equation (7), we have $\begin{matrix} \bar{{〈f, g〉}_{H_{G}}} & = \bar{\int_{- \infty}^{\infty} \hat{f} (t) \bar{\hat{g} (t)} exp (\frac{σ^{2}}{2} t^{2}) d t} = \int_{- \infty}^{\infty} \bar{\hat{f} (t)} \hat{g} (t) exp (\frac{σ^{2}}{2} t^{2}) d t \\ = \int_{- \infty}^{\infty} \hat{f} (- t) \hat{g} (t) exp (\frac{σ^{2}}{2} t^{2}) d t = \int_{- \infty}^{\infty} \hat{f} (t) \hat{g} (- t) exp (\frac{σ^{2}}{2} t^{2}) d t \\ = \int_{- \infty}^{\infty} \hat{f} (t) \bar{\hat{g} (t)} exp (\frac{σ^{2}}{2} t^{2}) d t = {〈f, g〉}_{H_{G}} . \end{matrix}$

Therefore, ${〈f, g〉}_{H_{G}} \in R$ . $□$

Proposition 8

If $f \in H_{G} (R, C)$ , then $\bar{f} \in H_{G} (R, C)$ .

Proof

From Proposition 5, $\hat{\bar{f}} (t) = \bar{\hat{f} (- t)}$ for $f \in L^{2} (R, C)$ . Then, using Equation (7), we have $\begin{matrix} {∥\bar{f}∥}_{H_{G} (R, C)}^{2} & = {〈\bar{f}, \bar{f}〉}_{H_{G} (R, C)} \\ = \int_{- \infty}^{\infty} {|\hat{\bar{f}} (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t = \int_{- \infty}^{\infty} {|\bar{\hat{f} (- t)}|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t \\ = \int_{- \infty}^{\infty} {|\hat{f} (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t = {∥f∥}_{H_{G} (R, C)}^{2} < \infty . \end{matrix}$

Therefore, $\bar{f} \in H_{G} (R, C)$ . $□$

Here, we denote by $Re$ and $Im$ the real part and the imaginary part of a complex number, respectively. We also denote by $Cl (A)$ the closure of a subset A in a topological space.

Corollary 3

If $f \in H_{G} (R, C)$ , then $Re (f), Im (f) \in H_{G}$ .

Proof

If $f \in H_{G} (R, C)$ , then $\bar{f} \in H_{G} (R, C)$ by Proposition 8. Hence we see that $\begin{matrix} Re (f) = \frac{f + \bar{f}}{2} \in H_{G}, Im (f) = \frac{f - \bar{f}}{2 \sqrt{- 1}} \in H_{G} . \end{matrix}$

This completes the proof. $□$

Remark 3

If $f \in H_{G} (R, C)$ , then there uniquely exist $f_{1}, f_{2} \in H_{G}$ such that $f = f_{1} + \sqrt{- 1} f_{2}$ by Corollary 3. This means that $H_{G} (R, C) = H_{G} \oplus \sqrt{- 1} H_{G}$ , where $\oplus$ denotes the direct sum.

Proposition 9

For any $f \in L^{2} (R, C)$ and for any $ϵ > 0$ , there exists $g \in H_{G} (R, C)$ such that ${‖ f - g ‖}_{2} < ϵ$ . In other words, $H_{G} (R, C)$ is dense in $L^{2} (R, C)$ .

Proof

Let $C_{0} (R, C)$ denote the space of continuous complex-valued functions with compact support on $R$ . Let us define ${\hat{H}}_{G} (R, C)$ by $\begin{matrix} {\hat{H}}_{G} (R, C) : = \{\begin{matrix} h \in L^{2} (R, C) & \int_{- \infty}^{\infty} {|h (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t < \infty \end{matrix}\} . \end{matrix}$

Note that ${\hat{H}}_{G} (R, C)$ coincides with the image of $H_{G} (R, C)$ by the Fourier transform. Then, $C_{0} (R, C) \subset {\hat{H}}_{G} (R, C) \subset L^{2} (R, C)$ and $Cl (C_{0} (R, C)) = L^{2} (R, C)$ . Hence $Cl ({\hat{H}}_{G} (R, C)) = L^{2} (R, C)$ . In other words, for any $f \in L^{2} (R, C)$ and for any $ϵ > 0$ , there exists $\hat{g} \in {\hat{H}}_{G} (R, C)$ such that $‖ \hat{f} - \hat{g} ‖_{2} < ϵ$ because $\hat{f} \in L^{2} (R, C)$ , which implies that there exists $g \in H_{G} (R, C)$ such that ${‖ f - g ‖}_{2} < ϵ$ . This completes the proof. $□$

The following corollary has also been shown in Theorem 4.63 in Steinwart and Christmann (Citation2008).

Corollary 4

$Cl (H_{G}) = L^{2} (R)$ .

Proof

From Proposition 9, for any $f \in L^{2} (R) \subset L^{2} (R, C)$ and for any $ϵ > 0$ , there exists $g \in H_{G} (R, C)$ such that ${‖ f - g ‖}_{2} < ϵ$ . By Remark 3, there exist $g_{1}, g_{2} \in H_{G}$ such that $g = g_{1} + \sqrt{- 1} g_{2}$ . Thus, $\begin{matrix} ϵ^{2} & > {∥f - g∥}_{2}^{2} = \int_{- \infty}^{\infty} {|f - g|}^{2} d x = \int_{- \infty}^{\infty} {|(f - g_{1}) - \sqrt{- 1} g_{2}|}^{2} d x \\ \geq \int_{- \infty}^{\infty} {|f - g_{1}|}^{2} d x = {∥f - g_{1}∥}_{2}^{2} . \end{matrix}$

Therefore, ${∥f - g_{1}∥}_{2} < ϵ$ . This completes the proof. $□$

Definition 11

Let us define $r, r_{n} \in L^{2} (R)$ as $\begin{matrix} r (t) : = \frac{1}{|t|} 1_{(1, \infty)} (| t |), r_{n} (t) : = \frac{1}{|t|} 1_{(1, n)} (| t |), \end{matrix}$

where $1_{(1, \infty)}$ and $1_{(1, n)}$ denote the indicator functions of the intervals $(1, \infty)$ and $(1, n)$ , respectively. We also put $h_{n} : = \overset{ˇ}{r_{n}}$ and $h : = \overset{ˇ}{r}$ . Note that ${lim}_{n \to \infty} r_{n} = r \in L^{2} (R)$ , because $\begin{matrix} lim_{n \to \infty} {∥r_{n} - r∥}_{2}^{2} = 2 lim_{n \to \infty} \int_{n}^{\infty} \frac{1}{x^{2}} d x = 0 . \end{matrix}$

Proposition 10

$h_{n}, h \in L^{2} (R)$ .

Proof

It is obvious that $h_{n}, h \in L^{2} (R, C)$ . Since $r_{n} \in L^{1} (R) \cap L^{2} (R)$ , we see that $\begin{matrix} \bar{h_{n} (x)} & = \bar{\overset{ˇ}{r_{n}} (x)} = \bar{\frac{1}{\sqrt{2 π}} \int_{- \infty}^{\infty} r_{n} (t) exp (\sqrt{- 1} t x) d t} \\ = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{\infty} r_{n} (t) exp (- \sqrt{- 1} t x) d t = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{\infty} r_{n} (- t^{'}) exp (\sqrt{- 1} t^{'} x) d t^{'} \\ = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{\infty} r_{n} (t^{'}) exp (\sqrt{- 1} t^{'} x) d t^{'} = \overset{ˇ}{r_{n}} (x) = h_{n} (x), \end{matrix}$

where $t^{'} = - t$ . Hence, $h_{n} \in L^{2} (R)$ . On the other hand, $\begin{matrix} h (x) & = lim_{n \to \infty} \frac{1}{\sqrt{2 π}} \int_{- n}^{n} r (t) exp (\sqrt{- 1} t x) d t \\ = lim_{n \to \infty} \frac{1}{\sqrt{2 π}} \int_{- \infty}^{\infty} r_{n} (t) exp (\sqrt{- 1} t x) d t \\ = lim_{n \to \infty} h_{n} (x) . \end{matrix}$

Therefore $h \in L^{2} (R)$ . $□$

Let us define $k_{a} (\cdot) : = \sqrt{2 π} σ k_{G} (\cdot, a) = exp (- \frac{{(\cdot - a)}^{2}}{2 σ^{2}}) \in H_{G}$ for $a \in R$ . Now, we prove that $k_{a} \notin Ran (C_{XX})$ for any $a \in R$ . This implies that $C_{XX}$ is not surjective.

Proposition 11

For any $a \in R$ , $k_{a} \in H_{G} \ ran (C_{XX})$ .

Proof

Suppose that there exists $g \in H_{G}$ such that $C_{XX} g = k_{a}$ . Then, for any $f \in H_{G}$ ,(12) $\begin{matrix} {〈k_{a}, f〉}_{H_{G}} = {〈C_{XX} g, f〉}_{H_{G}} . \end{matrix}$ (12)

Let us put $k (\cdot) = \sqrt{2 π} σ k_{G} (\cdot, 0) = exp (- \frac{{(\cdot - 0)}^{2}}{2 σ^{2}})$ . From Proposition 4, $\hat{k} (t) = σ exp (- \frac{σ^{2}}{2} t^{2})$ . Then, using Equation (7) and Proposition 6, the left hand side of Equation (12) equals $\begin{matrix} \int_{- \infty}^{\infty} {\hat{k}}_{a} (t) \bar{\hat{f} (t)} exp (\frac{σ^{2}}{2} t^{2}) d t & = \int_{- \infty}^{\infty} exp (- \sqrt{- 1} a t) \hat{k} (t) \bar{\hat{f} (t)} exp (\frac{σ^{2}}{2} t^{2}) d t \\ = σ \int_{- \infty}^{\infty} exp (- \sqrt{- 1} a t) \bar{\hat{f} (t)} d t . \end{matrix}$

The right hand side of Equation (12) is equal to $\begin{matrix} E [g (X) f (X)] & = \int_{- \infty}^{\infty} g (x) f (x) p (x) d x = {〈g p, f〉}_{L^{2} (R)} . \end{matrix}$

Thus, Equation (12) is equivalent to the following equation:(13) $\begin{matrix} {〈g p, f〉}_{L^{2} (R)} = σ \int_{- \infty}^{\infty} exp (- \sqrt{- 1} a t) \bar{\hat{f} (t)} d t . \end{matrix}$ (13)

Let us define $h_{n, a} (x) : = h_{n} (x - a)$ and $h_{a} (x) = h (x - a)$ . Then $h_{n, a}, h_{a} \in L^{2} (R)$ . It is easy to see that $‖ h_{n, a} - h_{a} ‖_{2} = ‖ h_{n} {- h ‖}_{2} = {‖ r_{n} - r ‖}_{2} \to 0$ as $n \to \infty$ . Hence $lim_{n \to \infty} h_{n, a} = h_{a}$ in $L^{2} (R)$ . Since ${\hat{h}}_{n, a} (t) = exp (- \sqrt{- 1} a t) {\hat{h}}_{n} (t)$ by Proposition 6, we have $\begin{matrix} \int_{- \infty}^{\infty} {|{\hat{h}}_{n, a} (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t & = \int_{- \infty}^{\infty} {|{\hat{h}}_{n} (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t = \int_{- \infty}^{\infty} {|r_{n} (t)|}^{2} exp (\frac{σ^{2}}{2} t^{2}) d t \\ = 2 \int_{1}^{n} \frac{1}{t^{2}} exp (\frac{σ^{2}}{2} t^{2}) d t < \infty, \end{matrix}$

which indicates that $h_{n, a} \in H_{G}$ . Substituting $h_{n, a}$ for f, Equation (13) becomes(14) $\begin{matrix} {〈g p, h_{n, a}〉}_{L^{2} (R)} = σ \int_{- \infty}^{\infty} exp (- \sqrt{- 1} a t) \bar{{\hat{h}}_{n, a} (t)} d t . \end{matrix}$ (14)

If n goes to infinity, the left hand side of Equation (14) becomes ${⟨ g p, h_{a} ⟩}_{L^{2} (R)} \in R$ . On the other hand, the right hand side of Equation (14) becomes $\begin{matrix} σ \int_{- \infty}^{\infty} exp (- \sqrt{- 1} a t) \bar{exp (- \sqrt{- 1} a t) {\hat{h}}_{n} (t)} d t & = σ \int_{- \infty}^{\infty} \bar{{\hat{h}}_{n} (t)} d t \\ = σ \int_{- \infty}^{\infty} \bar{r_{n} (t)} d t \\ = 2 σ \int_{1}^{n} \frac{1}{t} d t \to \infty (n \to \infty) . \end{matrix}$

This is a contradiction. Therefore, there exists no $g \in H_{G}$ such that $C_{XX} g = k_{a}$ . This completes the proof. $□$

7. Supplementary material

Supplementary material for this article can be accessed here http://dx.doi.org/10.1080/23311835.2018.1447220.

Supplemental material

Suppl.pdf

Download ()

Additional information

Funding

Kazunori Nakamoto was partially supported by JSPS KAKENHI [grant numbers JP23540044, JP15K04814].

Notes on contributors

Hisashi Johno

Hisashi Johno is a PhD student at the Department of Mathematical Sciences, Faculty of Medicine, University of Yamanashi, Japan. His current research interests include probability theory and interpretable machine learning.

Kazunori Nakamoto

Kazunori Nakamoto is a professor of mathematics at Center for Medical Education and Sciences, Faculty of Medicine, University of Yamanashi, Japan. His main research interests include algebraic geometry, invariant theory, and the moduli of representations.

Tatsuhiko Saigo

Tatsuhiko Saigo is an associate professor of probability and statistics at Center for Medical Education and Sciences, Faculty of Medicine, University of Yamanashi, Japan. His research fields include probability theory, statistics, and applied mathematics.

Related Research Data

Remarks on kernel Bayes’ rule

Source: Figshare

Remarks on kernel Bayes’ rule

Source: Figshare

Remarks on kernel Bayes’ rule

Source: Figshare

Linking provided by

References

Fukumizu, K. (2014). Introduction to kernel methods (in Japanese). Tokyo: Asakura Shoten.
Google Scholar
Fukumizu, K., Song, L., & Gretton, A. (2013). Kernel bayes’ rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14, 3753–3783.
Web of Science ®Google Scholar
Horn, R. A. , & Johnson, C. R. (2013). Matrix analysis (2nd ed.). Cambridge: Cambridge University Press.
Google Scholar
Kanagawa, M., Nishiyama, Y., Gretton, A. , & Fukumizu, K. (2014). Monte carlo filtering using kernel embedding of distributions. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (p. 1897-1903).
Google Scholar
Rudin, W. (1987). Real and complex analysis (3rd ed.). New York, NY: McGraw-Hill Book Co..
Google Scholar
Song, L., Fukumizu, K., & Gretton, A. (2014). Kernel embeddings of conditional distributions. IEEE Signal Processing Magazine, 30, 98–111.
Web of Science ®Google Scholar
Song, L., Huang, J., Smola, A. , & Fukumizu, K. (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning (p. 961-968).
Google Scholar
Steinwart, I. , & Christmann, A. (2008). Support vector machines. New York, NY: Springer.
Google Scholar

Remarks on kernel Bayes’ rule

Abstract

Public interest statement

1. Introduction

2. Kernel Bayes’ rule

3. Theoretical arguments

3.1. Relations between the posterior ${\hat{m}}_{Q_{X | y}}$ and the prior $Π$

3.2. The inverse of the operators $C_{XX}$ and $C_{WW}$

3.3. Conditional expectation functions and RKHS

4. Numerical experiments

4.1. Algorithms of the three classifiers, BR, KBR1, and KBR2

4.1.1. The algorithm of BR

4.1.2. The algorithm of KBR1

4.1.3. The algorithm of KBR2

4.2. Probabilistic predictions by the three classifiers

5. Conclusions

6. Proofs

6.1. Estimation of $C_{ZW}$ and $C_{WW}$

6.2. Non-singularity of $G_{Y}$ and $Λ$

6.3. Non-surjectivity of $C_{XX}$ and $C_{WW}$

7. Supplementary material

Suppl.pdf

Notes on contributors

Hisashi Johno

Kazunori Nakamoto

Tatsuhiko Saigo

Related Research Data

References

Information for

Open access

Opportunities

Help and information

Remarks on kernel Bayes’ rule

Abstract

Public interest statement

1. Introduction

2. Kernel Bayes’ rule

3. Theoretical arguments

3.1. Relations between the posterior m^QX|y and the prior Π

3.2. The inverse of the operators CXX and CWW

3.3. Conditional expectation functions and RKHS

4. Numerical experiments

4.1. Algorithms of the three classifiers, BR, KBR1, and KBR2

4.1.1. The algorithm of BR

4.1.2. The algorithm of KBR1

4.1.3. The algorithm of KBR2

4.2. Probabilistic predictions by the three classifiers

5. Conclusions

6. Proofs

6.1. Estimation of CZW and CWW

6.2. Non-singularity of GY and Λ

6.3. Non-surjectivity of CXX and CWW

7. Supplementary material

Suppl.pdf

Additional information

Funding

Notes on contributors

Hisashi Johno

Kazunori Nakamoto

Tatsuhiko Saigo

Related Research Data

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

3.1. Relations between the posterior ${\hat{m}}_{Q_{X | y}}$ and the prior $Π$

3.2. The inverse of the operators $C_{XX}$ and $C_{WW}$

6.1. Estimation of $C_{ZW}$ and $C_{WW}$

6.2. Non-singularity of $G_{Y}$ and $Λ$

6.3. Non-surjectivity of $C_{XX}$ and $C_{WW}$