772
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Remarks on kernel Bayes’ rule

, & | (Reviewing Editor)
Article: 1447220 | Received 09 Apr 2017, Accepted 25 Feb 2018, Published online: 24 Apr 2018

Abstract

The kernel Bayes’ rule has been proposed as a nonparametric kernel-based method to realize Bayesian inference in reproducing kernel Hilbert spaces. However, we demonstrate both theoretically and experimentally that the way of incorporating the prior in the kernel Bayes’ rule is unnatural. In particular, we show that under some reasonable conditions, the posterior in the kernel Bayes’ rule is completely unaffected by the prior, which seems to be irrelevant in the context of Bayesian inference. We consider that this phenomenon is in part due to the fact that the assumptions in the kernel Bayes’ rule do not hold in general.

Public interest statement

This paper examines the validity of the kernel Bayes’ rule, a recently proposed nonparametric framework for Bayesian inference. The researchers on the kernel Bayes’ rule are aiming to apply this method to a wide range of Bayesian inference problems. However, as we demonstrate in this paper, the way of incorporating the prior in the kernel Bayes’ rule seems wrong in the context of Bayesian inference. Several theorems of the kernel Bayes’ rule rely on a strong assumption which does not hold in general.

The problems of the kernel Bayes’ rule seem to be nontrivial and difficult, and we have currently no idea to solve them. We hope that this study would trigger reexamination and correction of the basic framework of the kernel Bayes’ rule.

1. Introduction

The kernel Bayes’ rule has recently emerged as a novel framework for Bayesian inference (Fukumizu, Song, & Gretton, Citation2013; Song, Fukumizu, & Gretton, Citation2014; Song, Huang, Smola, & Fukumizu, Citation2009). It is generally agreed that, in this framework, we can estimate the kernel mean of the posterior distribution, given kernel mean expressions of the prior and likelihood distributions. Since the distributions are mapped and nonparametrically manipulated in infinite-dimensional feature spaces called reproducing kernel Hilbert spaces (RKHS), it is believed that the kernel Bayes’ rule can accurately evaluate the statistical features of high-dimensional data and enable Bayesian inference even if there were no appropriate parametric models. To date, several applications of the kernel Bayes’ rule have been reported (Fukumizu et al., Citation2013; Kanagawa et al., Citation2014). However, the basic theory and the algorithm of the kernel Bayes’ rule might need to be modified because of the following reasons:

(1)

The posterior in the kernel Bayes’ rule is in some cases completely unaffected by the prior.

(2)

The posterior in the kernel Bayes’ rule considerably depends upon the choice of the parameters to regularize covariance operators.

(3)

It does not hold in general that conditional expectation functions are included in the RKHS, which is an essential assumption of the kernel Bayes’ rule.

This paper is organized as follows. We begin in Section 2 with a brief review of the kernel Bayes’ rule. In Section 3, we theoretically address the three arguments described above. Numerical experiments are performed in Section 4 to confirm the theoretical results in Section 3. In Section 5, we summarize the theoretical and experimental results and present our conclusions. Some of the proofs for Sections 2 and 3 are given in Section 6.

2. Kernel Bayes’ rule

In this section, we briefly review the kernel Bayes’ rule following (Fukumizu et al., Citation2013). Let X and Y be measurable spaces, (X,Y) be a random variable with an observed distribution P on X×Y, U be a random variable with the prior distribution Π on X, and (Z,W) be a random variable with the joint distribution Q on X×Y. Note that Q is defined by the prior Π and the family {PY|x|xX}, where PY|x denotes the conditional distribution of Y given X=x. For each yY, let QX|y represent the posterior distribution of Z given W=y. The aim of the kernel Bayes’ rule is to derive the kernel mean of QX|y.

Definition 1

Let kX and kY be measurable positive definite kernels on X and Y such that E[kX(X,X)]< and E[kY(Y,Y)]<, respectively, where E[·] denotes the expectation operator. Let HX and HY be the RKHS defined by kX and kY, respectively. We consider two bounded linear operators CYX:HXHY and CXX:HXHX such that(1) g,CYXfHY=Ef(X)g(Y)andf1,CXXf2HX=Ef1(X)f2(X)(1)

for any f,f1,f2HX and gHY, where ·,·HX and ·,·HY denote inner products on HX and HY, respectively. The integral expressions for CYX and CXX are given by(CYXf)(·)=X×YkY(·,y)f(x)dP(x,y)and(CXXf)(·)=XkX(·,x)f(x)dPX(x),

where PX denotes the marginal distribution of X. Let CXY be the bounded linear operator defined byf,CXYHX=Ef(X)g(Y)

for any fHX and gHY. Then CXY is the adjoint of CYX.

Theorem 1

(Fukumizu et al., Citation2013, Theorem 1)    If E[g(Y)|X=·]HX for gHY, then CXXE[g(Y)|X=·]=CXYg.

Definition 2

Let QY denote the marginal distribution of W. Assuming that E[kX(U,U)]< and E[kY(W,W)]<, we can define the kernel means of Π and QY bymΠ=E[kX(·,U)]andmQY=E[kY(·,W)],

respectively. Due to the reproducing properties of HX and HY, the kernel means satisfy f,mΠHX=E[f(U)] and g,mQYHY=E[g(W)] for any fHX and gHY.

Theorem 2

(Fukumizu et al., Citation2013, Theorem 2) If CXX is injective, mΠRan(CXX), and E[g(Y)|X=·]HX for any gHY, then(2) mQY=CYXCXX-1mΠ,(2)

where Ran(CXX) denotes the range of CXX.

Here we have, for any xX(3) EkY(·,Y)|X=x=CYXCXX-1kX(·,x)(3)

by replacing mΠ in Equation (2) for kX(·,x). It is noted in Fukumizu et al. (Citation2013) that the assumption mΠRan(CXX) does not hold in general. In order to remove this assumption, (CXX+ϵI)-1 has been suggested to be used instead of CXX-1, where ϵ is a regularization constant and I is the identity operator. Thus, the approximations of Equations (2) and (3) are respectively given bymQYreg=CYXCXX+ϵI-1mΠandEregkY(·,Y)|X=x=CYXCXX+ϵI-1kX(·,x).

Similarly, for any yY, the approximation of mQX|y is provided by(4) mQX|yreg=EregkX(·,Z)|W=y=CZWCWW+δI-1kY(·,y),(4)

where δ is a regularization constant and the linear operators CZW and CWW will be defined below.

Definition 3

We consider the kernel means mQ=m(ZW) and m(WW) such thatm(ZW),gfHYHX=Ef(Z)g(W)andm(WW),g1g2HYHY=Eg1(W)g2(W)

for any fHX and g,g1,g2HY, where denotes the tensor product. Let C(YX)X:HXHYHX and C(YY)X:HXHYHY be bounded linear operators which respectively satisfy(5) gf,C(YX)XhHYHX=Eg(Y)f(X)h(X),g1g2,C(YY)XfHYHY=Eg1(Y)g2(Y)f(X)(5)

for any f,hHX and g,g1,g2HY.

From Theorem 2, Fukumizu et al. (Citation2013) proposed that m(ZW) and m(WW) can be given bym(ZW)=C(YX)XCXX-1mΠHYHXandm(WW)=C(YY)XCXX-1mΠHYHY.

In case mΠ is not included in Ran(CXX), they suggested that m(ZW) and m(WW) could be approximated bym(ZW)reg=C(YX)XCXX+ϵI-1mΠandm(WW)reg=C(YY)XCXX+ϵI-1mΠ.

Remark 1

(Fukumizu et al., Citation2013, p. 3760)    m(ZW) and m(WW) can respectively be identified with CZW and CWW.

Here, we introduce the empirical method for estimating the posterior kernel mean mQX|y following (Fukumizu et al., Citation2013).

Definition 4

Suppose we have an independent and identically distributed (i.i.d.) sample {(Xi,Yi)}i=1n from the observed distribution P on X×Y and a sample {Uj}j=1l from the prior distribution Π on X. The prior kernel mean mΠ is estimated by(6) m^Π=j=1lγjkX(·,Uj),(6)

where γ1,,γl are weights. Let us put m^Π=(m^Π(X1),,m^Π(Xn))T, GX=(kX(Xi,Xj))1i,jn, and GY=(kY(Yi,Yj))1i,jn.

Proposition 1

(Fukumizu et al., Citation2013, Proposition 3, revised)   Let In denote the identity matrix of size n. The estimates of CZW and CWW are given byC^ZW=i=1nμ^ikX(·,Xi)kY(·,Yi)andC^WW=i=1nμ^ikY(·,Yi)kY(·,Yi),

respectively, where μ^=(μ^1,,μ^n)T=(GX+nϵIn)-1m^Π.

The proof of this revised proposition is given in Section 6.1. It is suggested in Fukumizu et al. (Citation2013) that Equation (4) can be empirically estimated bym^QX|y=C^ZWC^WW2+δIn-1C^WWkY(·,y).

Theorem 3

(Fukumizu et al., Citation2013, Proposition 4)    Given an observation yY, m^QX|y can be calculated bym^QX|y=kXTRX|YkY(y),RX|Y=ΛGY(ΛGY)2+δI-1Λ,

where Λ=diag(μ^) is the diagonal matrix with the elements of μ^, kX=kX(·,X1),,kX(·,Xn)T, and kY=kY(·,Y1),,kY(·,Yn)T.

If we want to know the posterior expectation of a function fHX given an observation yY, it is estimated byf,m^QX|yHX=fXTRX|YkY(y),

where fX=(f(X1),,f(Xn))T.

3. Theoretical arguments

In this section, we theoretically support the three arguments raised in Section 1. First, we show in Section 3.1 that the posterior kernel mean m^QX|y is completely unaffected by the prior distribution Π under the condition that Λ and GY are non-singular. This implies that, at least in some cases, Π does not properly affect m^QX|y. Second, we mention in Section 3.2 that the linear operators CXX and CWW are not always surjective, and address the problems associated with the setting of the regularization parameters ϵ and δ. Third, we demonstrate in Section 3.3 that conditional expectation functions are not generally contained in the RKHS, which means that Theorems 1, 2, and 5–8 in Fukumizu et al. (Citation2013) do not work in some situations.

3.1. Relations between the posterior m^QX|y and the prior Π

Let us review Theorem 3. Assume that GY and Λ are non-singular matrices. (This assumption is not so strange, as shown in Section 6.2.) The matrix RX|Y=ΛGY((ΛGY)2+δI)-1Λ tends to GY-1 as δ tends to 0. Furthermore, if we set δ=0 from the beginning, we obtain RX|Y=GY-1. This implies that the posterior kernel mean m^QX|y=kXTRX|YkY(y) never depends on the prior distribution Π on X, which seems to be a contradiction to the nature of Bayes’ rule.

Some readers may argue that, even in this case, we should not set δ=0. Then, however, there is ambiguity about why and how the regularization parameters are introduced in the kernel Bayes’ rule, since Fukumizu et al. originally used the regularization parameters just to solve inverse problems as an analog of ridge regression (Fukumizu et al., Citation2013, p. 3758). They seem to support the validity of the regularization parameters by Theorems 5, 6, 7, and 8 in Fukumizu et al. (Citation2013), however, these theorems do not work without the strong assumption that conditional expectation functions are included in the RKHS, as will be discussed in Section 3.2. In addition, since the theorems work only when δn, etc. decay to zero sufficiently slowly, it seems that we have no principled way to choose values for the regularization parameters, except for cross-validation or similar techniques. It is worth mentioning that, in our simple experiments in Section 4.2, we could not obtain a reasonable result with the kernel Bayes’ rule using any combination of values for the regularization parameters.

3.2. The inverse of the operators CXX and CWW

As noted by Fukumizu et al. (Citation2013), the linear operators CXX and CWW are not surjective in some usual cases, the proof of which is given in Section 6.3. Therefore, they proposed an alternative way of obtaining a solution fHX of the equation CXXf=mΠ, that is, a regularized inversion f=(CXX+ϵI)-1mΠ as an analog of ridge regression, where ϵ is a regularization parameter and I is an identity operator. One of the disadvantages of this method is that the solution f=(CXX+ϵI)-1mΠ depends upon the choice of ϵ. In Section 4.2, we numerically show that the prediction using the kernel Bayes’ rule considerably depends on the regularization parameters ϵ and δ. Theorems 5–8 in Fukumizu et al. (Citation2013) seem to support the appropriateness of the regularized inversion. However, these theorems work under the condition that conditional expectation functions are contained in the RKHS, which does not hold in some cases as proved in Section 3.3. Furthermore, since we need to assume sufficiently slow decay of the regularization constants ϵ and δ in these theorems, it is practically difficult to set appropriate values for ϵ and δ. A cross-validation procedure seems to be useful for tuning the parameters and we may obtain good experimental results, however, it seems to lack theoretical background.

Instead of the regularized inversion method, we can compute generalized inverse matrices of GX and ΛGY, given a sample {(Xi,Yi)}i=1n. Below, we briefly introduce a generalization of a matrix inverse. For more details, see Horn and Johnson (Citation2013).

Definition 5

Let A be a matrix of size m×n over the complex number space C. We say that a matrix A× of size n×m is a generalized inverse matrix of A if AA×A=A. We also say that a matrix A of size n×m is the Moore-Penrose generalized inverse matrix of A if AA and AA are Hermitian, AAA=A, and AAA=A.

Remark 2

In fact, any matrix A has the Moore-Penrose generalized inverse matrix A. Note that A is uniquely determined by A. If A is square and non-singular, then A×=A=A-1. For a generalized inverse matrix A× of size n×m, AA×v=v for any vector vCm if v is contained in the image of A. In particular, A×v is a vector contained in the preimage of v under A.

In the calculation of m^QX|y=kXTRX|YkY(y), we numerically compare the case RX|Y=(ΛGY)Λ with the original case RX|Y=ΛGY((ΛGY)2+δI)-1Λ in Section 4.2, where Λ=diag(GXm^Π).

3.3. Conditional expectation functions and RKHS

In this subsection, we show that conditional expectation functions are in some cases not contained in the RKHS.

Definition 6

For p[1,), we define the spaces Lp(R), Lp(R,C), and Lp(R2,R) asLp(R):=f:RR|-|f(x)|pdx<,Lp(R,C):={f:RC|-|f(x)|pdx<}Lp(R2,R):=f:R2R|R2|f(x1,x2)|pdx1dx2<.

We also define the Lp norm for fLp(R) or fLp(R,C) asfp:=-|f(x)|pdx1p,

and the Lp norm for fLp(R2,R) asfp:=R2|f(x1,x2)|pdx1dx21p.

Definition 7

For a function fL1(R,C)L2(R,C), we define its Fourier transform asf^(t):=12π-f(x)exp(--1tx)dx.

We can uniquely extend the Fourier transform to an isometry ^:L2(R,C)L2(R,C). We also define the inverse Fourier transform ˇ:L2(R,C)L2(R,C) as an isometry uniquely determined byfˇ(t):=12π-f(x)exp(-1tx)dx

for fL1(R,C)L2(R,C).

Definition 8

Let us define a Gaussian kernel kG on R bykG(x,y):=12πσexp-(x-y)22σ2.

As described in Fukumizu (Citation2014), the RKHS of real-valued functions and complex-valued functions corresponding to the positive definite kernel kG are given byHG:=fL2(R)-f^(t)2expσ22t2dt<,HGR,C:=fL2(R,C)-f^(t)2expσ22t2dt<,

respectively, and the inner product of f,gHG or f,gHGR,C on the RKHS is calculated by(7) f,g=-f^(t)g^(t)¯expσ22t2dt,(7)

where the overline denotes the complex conjugate. Remark that HG is a real Hilbert subspace contained in the complex Hilbert space HG(R,C).

Fukumizu et al. (Citation2013) mentioned that the conditional expectation function E[g(Y)|X=·] is not always included in HX. Indeed, if the variables X and Y are independent, then E[g(Y)|X=·] becomes a constant function on X, the value of which might be non-zero. In the case that X=R and kX=kG, the constant function with non-zero value is not contained in HX=HG.

Additionally, in order to prove Theorems 5 and 8 in Fukumizu et al. (Citation2013), they made the assumption that E[kY(Y,Y~)|X=x,X~=x~]HXHX and E[kX(Z,Z~)|W=y,W~=y~]HYHY, where (X~,Y~) and (Z~,W~) are independent copies of the random variables (X,Y) and (Z,W) on X×Y, respectively. We also see that this assumption does not hold in general. Suppose that X and Y are independent and that so are X~ and Y~. Then E[kY(Y,Y~)X=x,X~=x~] is a constant function of (x,x~), the value of which might be non-zero. In the case that X=R and kX=kG, the constant function having non-zero value is not contained in HXHX=HGHG. Note that HGHG is isomorphic to the RKHS corresponding to the kernel k((x1,x2),(x~1,x~2))=kG(x1,x~1)kG(x2,x~2) on R2, that is,HGHG=fL2(R2,R)R2f^(t1,t2)2expσ22(t12+t22)dt1dt2<,

where the Fourier transform of f:R2R is defined byf^(t1,t2):=limn12π2x12+x22<nf(x1,x2)exp--1(t1x1+t2x2)dx1dx2.

Thus, the assumption that conditional expectation functions are included in the RKHS does not hold in general. Since most of the theorems in Fukumizu et al. (Citation2013) require this assumption, the kernel Bayes’ rule may not work in several cases.

4. Numerical experiments

In this section, we perform numerical experiments to illustrate the theoretical results in Sections 3.1 and 3.2. We first introduce probabilistic classifiers in Section 4.1 based on conventional Bayes’ rule assuming Gaussian distributions (BR), the original kernel Bayes’ rule (KBR1), and the kernel Bayes’ rule using Moore-Penrose generalized inverse matrices (KBR2). In Section 4.2, we apply the three classifiers to a binary classification problem with computer-simulated data sets. Numerical experiments are implemented in version 2.7.6 of the Python software (Python Software Foundation, Wolfeboro Falls, NH, USA).

4.1. Algorithms of the three classifiers, BR, KBR1, and KBR2

Let (X,Y) be a random variable with a distribution P on X×Y, where X={C1,,Cg} is a family of classes and Y=Rd. Let Π and Q be the prior and the joint distributions on X and X×Y, respectively. Suppose we have an i.i.d. training sample {(Xi,Yi)}i=1n from the distribution P. The aim of this subsection is to derive algorithms of the three classifiers, BR, KBR1, and KBR2, which respectively calculate the posterior probability for each class given an observation yY, that is, QX|y(C1),,QX|y(Cg).

4.1.1. The algorithm of BR

In BR, we estimate the posterior probability of j-th class (j=1,,g) given a test value yY byQ^X|y(Cj)=P^Y|Cj(y)Π(Cj)k=1gP^Y|Ck(y)Π(Ck),

where P^Y|Cj(·) is the density function of the d-dimensional normal distribution N(M^j,S^j) defined byP^Y|Cj(·)=1(2π)dS^jexp-12(·-M^j)TS^j-1(·-M^j).

The mean vector M^jRd and the covariance matrix S^jRd×d are calculated from the training data of the class Cj.

4.1.2. The algorithm of KBR1

Let us define positive definite kernels kX and kY askX(X,X)=1(X=X)0(XX)andkY(Y,Y)=12πσexp-Y-Y22σ2

for X,XX and Y,YY, and the corresponding RKHS as HX and HY, respectively. Here we set Y=i=1dyi2 for Y=(y1,y2,,yd)TY=Rd. Then, the prior kernel mean is given bym^Π(·)=j=1gΠ(Cj)kX(·,Cj),

where j=1gΠ(Cj)=1. Let us put GX=(kX(Xi,Xj))1i,jn, GY=(kY(Yi,Yj))1i,jn, D=(1{Ci}(Xj))1ig,1jn{0,1}g×n, m^Π=(m^Π(X1),,m^Π(Xn))T, μ^=(μ^1,,μ^n)T=(GX+nϵIn)-1m^Π, Λ=diag(μ^), kX(·)=kX(·,X1),,kX(·,Xn)T, kY(·)=kY(·,Y1),,kY(·,Yn)T, and RX|Y=ΛGY((ΛGY)2+δIn)-1Λ, where In is the identity matrix of size n and ϵ,δR are heuristically set regularization parameters. Note that 1A stands for the indicator function of a set A described as 1A(t):=1(tA)0(tA).

Following Theorem 3, the posterior kernel mean given a test value yY is estimated bym^QX|y=kXTRX|YkY(y).

Here, we estimate the posterior probabilities for classes given a test value yY by Q^X|y(C1)Q^X|y(Cg)=1{C1},m^QX|yHX1{Cg},m^QX|yHX=DRX|YkY(y).

4.1.3. The algorithm of KBR2

Let GX denote the Moore-Penrose generalized inverse matrix of GX. Let us put μ^=(μ^1,,μ^n)T=GXm^Π, Λ=diag(μ^), and RX|Y=ΛGYΛ. Replacing RX|Y in Section 4.1.2 for RX|Y, the posterior probabilities for classes given a test value yY is estimated byQ^X|y(C1),,Q^X|y(Cg)T=DRX|YkY(y).

4.2. Probabilistic predictions by the three classifiers

Here, we apply the three classifiers defined in Section 4.1 to a binary classification problem using computer-simulated data-sets, where X={C1,C2} and Y=R2. In the first step, we independently generate 100 sets of training samples with each training sample being {(Xi,Yi)X×Y}i=1100, where Xi=C1 and YiN(M1,S1) if 1i50, Xi=C2 and YiN(M2,S2) if 51i100, M1=(1,0)T, M2=(0,1)T, and S1=S2=diag(0.1,0.1). Here, {Yi}i=150 and {Yi}i=51100 are sampled i.i.d. from N(M1,S1) and N(M2,S2), respectively. Individual Y-values of one of the training samples are plotted in Figure .

Figure 1. Individual Y-values of a training sample.

Figure 1. Individual Y-values of a training sample.

With each of the 100 training samples and a simulated prior probability of C1, or Π(C1){0.1,0.2,,0.9}, the classifiers defined in Section 4.1 estimate the posterior probability of C1 given a test value y{(0.5,0.5),(0.6,0.4),(0.7,0.3)}, that is, QX|y(C1). Figures show the mean (plus or minus standard error of the mean, SEM) of the 100 values of Q^X|y(C1) calculated by each of the classifiers, BR, KBR1, and KBR2. Here we show the case where σ in KBR1 and KBR2 is fixed to 0.1, and the regularization parameters of KBR1 are set to be ϵ=δ=10-7 (Figure ), ϵ=δ=10-5 (Figure ), ϵ=δ=10-3 (Figure ), and ϵ=δ=10-1 (Figure ). In Figures , BR_th represents the theoretical value of BR, which coincides with BR if the parameters M^1, M^2, S^1, and S^2 are set to be M1, M2, S1, and S2, respectively.

Figure 2. The case ϵ=δ=10-7.

Figure 2. The case ϵ=δ=10-7.

Consistent to Section 3.1, Q^X|y(C1) calculated by KBR1 is poorly influenced by Π(C1) compared with that by BR when ϵ and δ are set to be small (see Figures and ). In addition, Q^X|y(C1) calculated by KBR2 also seems to be uninfluenced by Π(C1). When ϵ and δ are set to be larger, the effect of Π(C1) on Q^X|y(C1) becomes apparent in KBR1, however, the value of Q^X|y(C1) becomes too small (see Figures and ). These results suggest that in the kernel Bayes’ rule, the posterior does not depend on the prior if ϵ and δ are negligible, which might be a contradiction to the nature of Bayes’ theorem. Moreover, even though the prior affects the posterior when ϵ and δ become larger, the posterior seems too much dependent on ϵ and δ, which are initially defined just for the regularization of matrices.

We have also tested all possible combinations of the following values for the parameters in KBR1 and/or KBR2: ϵ{10-1,10-3,10-5,10-7,10-9,10-11,10-13,10-15}, δ{10-1,10-3,10-5,10-7,10-9,10-11,10-13,10-15}, and σ{0.01,0.1,1,10,100}. All the experimental results have been evaluated in a similar manner as above, and none of the results are found to be reasonable in the context of Bayesian inference (see Supplementary material).

Figure 3. The case ϵ=δ=10-5.

Figure 3. The case ϵ=δ=10-5.

Figure 4. The case ϵ=δ=10-3.

Figure 4. The case ϵ=δ=10-3.

Figure 5. The case ϵ=δ=10-1.

Figure 5. The case ϵ=δ=10-1.

5. Conclusions

One of the important features of Bayesian inference is that it provides a reasonable way of updating the probability for a hypothesis as additional evidence is acquired. The kernel Bayes’ rule has been expected to enable Bayesian inference in RKHS. In other words, the posterior kernel mean has been considered to be reasonably estimated by the kernel Bayes’ rule, given kernel mean expressions of the prior and likelihood. What is “reasonable" depends on circumstances, however, some of the results in this paper seem to show obviously unreasonable aspects of the kernel Bayes’ rule, at least in the context of Bayesian inference.

First, as shown in Section 3.1, when Λ and GY are non-singular matrices and so we set δ=0, the posterior kernel mean m^QX|y is entirely unaffected by the prior distribution Π on X. This means that, in Bayesian inference with the kernel Bayes’ rule, prior beliefs are in some cases completely neglected in calculating the kernel mean of the posterior distribution. Numerical evidence is also presented in Section 4.2. When the regularization parameters ϵ and δ are set to be small, the posterior probability calculated by the kernel Bayes’ rule (KBR1) is almost unaffected by the prior probability in comparison with that by conventional Bayes’ rule (BR). Consistently, when the regularized inverse matrices in KBR1 are replaced for the Moore-Penrose generalized inverse matrices (KBR2), the posterior probability is also uninfluenced by the prior probability, which seems to be unsuitable in the context of Bayesian updating of a probability distribution.

Second, as discussed in Sections 3.2 and 4.2, the posterior estimated by the kernel Bayes’ rule considerably depends upon the regularization parameters ϵ and δ, which are originally introduced just for the regularization of matrices. A cross-validation approach is proposed in Fukumizu et al. (Citation2013) to search for the optimal values of the parameters. However, theoretical foundations seem to be insufficient for the correct tuning of the parameters. Furthermore, in our experimental settings, we are not able to obtain a reasonable result using any combination of the parameter values, suggesting the possibility that there are no appropriate values for the parameters in general. Thus, we consider it difficult to solve the problem that CXX and CWW are not surjective by just adding regularization parameters.

Third, as shown in Section 3.3, the assumption that conditional expectation functions are included in the RKHS does not hold in general. Since this assumption is necessary for most of the theorems in Fukumizu et al. (Citation2013), we believe that the assumption itself may need to be reconsidered.

In summary, even though current research efforts are focused on the application of the kernel Bayes’ rule (Fukumizu et al., Citation2013; Kanagawa et al., Citation2014), it might be necessary to reexamine its basic framework of combining new evidence with prior beliefs.

6. Proofs

In this section, we provide some proofs for Sections 2 and 3.

6.1. Estimation of CZW and CWW

Here we give the proof of Proposition 1.

Proof

Let C^XX, C^(YX)X, and C^(YY)X denote the estimates of CXX, C(YX)X, and C(YY)X, respectively. We define the estimates of m(ZW) and m(WW) asm^(ZW)=C^(YX)XC^XX-1m^Πandm^(WW)=C^(YY)XC^XX-1m^Π,

and put h=C^XX-1m^ΠHX. According to Equation (5), for any fHX and gHY,m^(ZW),gfHYHX=C^(YX)Xh,gfHYHX=E^f(X)g(Y)h(X)=1ni=1nf(Xi)g(Yi)h(Xi)=1ni=1nh(Xi)kX(·,Xi)kY(·,Yi),fgHXHY,

where E^[·] represents the empirical expectation operator. Thus, from Remark 1,(8) C^ZW=m^(ZW)=1ni=1nh(Xi)kX(·,Xi)kY(·,Yi).(8)

Similarly, for any g1,g2HY,m^(WW),g1g2HYHY=C^(YY)Xh,g1g2HYHY=E^g1(Y)g2(Y)h(X)=1ni=1ng1(Yi)g2(Yi)h(Xi)=1ni=1nh(Xi)kY(·,Yi)kY(·,Yi),g1g2HYHY.

Thus, from Remark 1,(9) C^WW=m^(WW)=1ni=1nh(Xi)kY(·,Yi)kY(·,Yi).(9)

Next, we will derive h(X1),,h(Xn). Since CXX is a self-adjoint operator,h,C^XXfHX=C^XXh,fHX=m^Π,fHX=j=1lγjf(Uj)

for any fHX. On the other hand, from Equation (1),h,C^XXfHX=E^f(X)h(X)=1ni=1nf(Xi)h(Xi)

for any fHX. Hence, we have(10) j=1lγjf(Uj)=1ni=1nf(Xi)h(Xi)(10)

for any fHX. Replacing f in Equation (10) for kX(X1,·),,kX(Xn,·)HX, we have(11) kX(X1,U1)kX(X1,Ul)kX(Xn,U1)kX(Xn,Ul)γ1γl=1nGXh(X1)h(Xn).(11)

Using Equation (6), the left hand side of Equation (11) is given byj=1lγjkX(·,Uj),kX(·,X1)HXj=1lγjkX(·,Uj),kX(·,Xn)HX=j=1lγjkX(X1,Uj)j=1lγjkX(Xn,Uj)=m^Π(X1)m^Π(Xn).

Therefore, we have1nh(X1)h(Xn)=GX-1m^Π(X1)m^Π(Xn)GX+nϵI-1m^Π=μ^.

Replacing 1n(h(X1),,h(Xn))T for μ^=(μ^1,,μ^n)T, Equations (8) and (9) becomeC^ZW=i=1nμ^ikX(·,Xi)kY(·,Yi)andC^WW=i=1nμ^ikY(·,Yi)kY(·,Yi),

respectively.

6.2. Non-singularity of GY and Λ

Here we show that the assumption in Section 3.1 holds under reasonable conditions.

Definition 9

Let f be a real-valued function defined on a non-empty open domain Dom(f)Rd. We say that f is analytic if f can be described by a Taylor expansion on a neighborhood of each point of Dom(f).

Proposition 2

Let k be a positive definite kernel on Rd. Let ν be a probability measure on Rd which is absolutely continuous with respect to Lebesgue measure. Assume that k is an analytic function on Rd×Rd and that the RKHS corresponding to k is infinite dimensional. Then for any i.i.d. random variables X1,X2,,Xn with the same distribution ν, the Gram matrix GX=(k(Xi,Xj))1i,jn is non-singular almost surely with respect to νn=ν×ν××ν(n times).

Proof

Let us put f(x1,x2,,xn):=det(k(xi,xj))1i,jn. Since the RKHS corresponding to k is infinite dimensional, there are ξ1,ξ2,,ξnRd such that {k(·,ξi)}1in are linearly independent. Then f(ξ1,ξ2,,ξn)0 and hence f is a non-zero analytic function. Note that any non-trivial subvarieties of the euclidean spaces defined by analytic functions have Lebesgue measure zero. By this fact, the subvarietyV(f):=(x1,x2,,xn)(Rd)nf(x1,x2,,xn)=0(Rd)n

has Lebesgue measure zero. Since ν is absolutely continuous, νn(V(f))=0. This completes the proof.

From Proposition 2, we easily obtain the following corollary.

Corollary 1

Let k be a Gaussian kernel on Rd and let X1,X2,,vXn be i.i.d. random variables with the same normal distribution on Rd. Then the Gram matrix GX=(k(Xi,Xj))1i,jn is non-singular almost surely.

Proposition 3

Let k be a positive definite kernel on X=Rd, ν a probability measure on X which is absolutely continuous with respect to Lebesgue measure. Assume that k is an analytic function on X×X and that the RKHS H corresponding to k is infinite dimensional. Then for any (ϵ,γ1,γ2,,γl,U1,U2,,Ul)R+×Rl×(Rd)l except Lebesgue measure zero, and for any i.i.d. random variables X1,X2,,Xn with the same distribution ν, each μi for i=1,2,,n is (defined almost surely and) non-zero almost surely, where (μ1,μ2,,μn)T=(GX+nϵIn)-1m^Π, m^Π=(m^Π(X1),m^Π(X2),,m^Π(Xn))T, and m^Π(·)=j=1lγjk(·,Uj). Here R+ denotes the set of positive real numbers.

Proof

Let us put S:=R+×Rl×(Rd)l, T:=Xn×S, andfi(x1,x2,,xn,ϵ,γ1,γ2,,γl,U1,U2,,Ul):=μi(i=1,2,,n)

for (x1,x2,,xn)Xn and (ϵ,γ1,γ2,,γl,U1,U2,,Ul)S. Since the RKHS corresponding to k is infinite- dimensional, we can obtain ξ1,ξ2,,ξnX such that {k(·,ξi)}1in are linearly independent. The Gram matrix (k(ξi,ξj))1i,jn=(k(·,ξi),k(·,ξj)H)1i,jn is positive definite, and its eigenvalues are all positive. Hence det((k(ξi,ξj))1i,jn+nϵIn)>0 for each ϵR+, and det(GX+nϵIn)=det((k(xi,xj))1i,jn+nϵIn) is a non-zero analytic function on Xn for each ϵR+.

For (ϵ,γ1,,γl,U1,,Ul)S, let us define a closed measure-zero set V(ϵ,γ1,,γl,U1,,Ul):={(x1,x2,,xn)Xndet(GX+nϵIn)=0}Xn. Then fi(,ϵ,γ1,,γl,U1,,Ul) is defined on Xn\V(ϵ,γ1,,γl,U1,,Ul) for each i{1,2,,n}. Using Cramer’s rule,μi=detη1,η2,,ηi-1,m^Π,ηi+1,,ηndetGX+nϵIn,

where ηm stands for the m-th column vector of GX+nϵIn. Here we denote by gi the numerator of μi, that is, gi=μidet(GX+nϵIn).

It is easy to see that gi(ξ1,ξ2,,ξn,) is a non-zero analytic function of on S. Indeed, if ϵ+0, U1=ξi, γ1=1, and γ2=γ3==γl=0, then gidet(k(·,ξi),k(·,ξj)H)1i,jn0. Hence Zi:={Sgi(ξ1,ξ2,,ξn,)=0} is a closed subset of S with Lebesgue measure zero for each i{1,2,,n}. For any (ϵ,γ1,,γl,U1,,Ul)S\(i=1nZi),Fi(ϵ,γ1,,γl,U1,,Ul):=Xngi,ϵ,γ1,,γl,U1,,Ul=0

is a closed subset of Xn with Lebesgue measure zero for each i{1,2,,n}, since gi(,ϵ,γ1,,γl,U1,,Ul) is a non-zero analytic function of on Xn. Therefore, μi=fi(,ϵ,γ1,,γl,U1,,Ul) is defined and non-zero for i=1,2,,n and for Xn\(V(ϵ,γ1,,γl,U1,,Ul)(j=1nFj(ϵ,γ1,,γl,U1,,Ul))) if (ϵ,γ1,,γl,U1,,Ul)S\(i=1nZi). This completes the proof.

The following corollary directly follows from Proposition 3.

Corollary 2

Let k be a Gaussian kernel on Rd and let X1,X2,,Xn be i.i.d. random variables with the same normal distribution on Rd. All other notations are as in Proposition 3. Then Λ:=diag(μ1,μ2,,μn) is non-singular almost surely for any (ϵ,γ1,γ2,,γl,U1,U2,,Ul)R+×Rl×(Rd)l except for those in a set of Lebesgue measure zero.

6.3. Non-surjectivity of CXX and CWW

The covariance operators CXX and CWW are not surjective in general. This can be verified by the fact that they are compact operators. (If the operators are surjective on the corresponding RKHS which is infinite-dimensional, then they cannot be compact because of the open mapping theorem.) Here we present some easy examples where CXX and CWW are not surjective. Let us consider for simplicity the case X=R. Let X be a random variable on R with a normal distribution N(μ,σ02). We prove that CXX is not surjective under the usual assumption that the positive definite kernel on R is Gaussian. In order to demonstrate this, we use the symbols defined in Section 3.3 and several proven results on function spaces and Fourier transforms (see Rudin, Citation1987, for example). Note that the following three propositions are introduced without proofs.

Proposition 4

Let us put f(x)=exp(-(ax2+bx+c)) for a,b,cR, where a>0. Thenf^(t)=12aexp-t2-2-1bt-b2+4ac4a.

Proposition 5

For fL2(R,C), f¯^(t)=f^(-t)¯ almost everywhere. In particular, if fL2(R), then f^(t)¯=f^(-t) almost everywhere.

Proposition 6

For fL2(R,C), put fa(x):=f(x-a). Then f^a(t)=exp--1atf^(t).

Definition 10

Let p(·) denote the density function of the normal distribution N(μ,σ02) on R, that is,p(·)=12πσ0exp-(·-μ)22σ02.

Let X be a random variable on R with N(μ,σ02). The linear operator CXX: HGHG is defined by CXXf,gHG=E[f(X)g(X)] for any f,gHG, which is also described as(CXXf)(·)=-f(x)k(·,x)p(x)dx

for any fHG.

Proposition 7

If f,gHG, then f,gHGR

Proof

From Proposition 5, f^(t)¯=f^(-t) and g^(t)¯=g^(-t) for any f,gHG. Then, using Equation (7), we havef,gHG¯=-f^(t)g^(t)¯expσ22t2dt¯=-f^(t)¯g^(t)expσ22t2dt=-f^(-t)g^(t)expσ22t2dt=-f^(t)g^(-t)expσ22t2dt=-f^(t)g^(t)¯expσ22t2dt=f,gHG.

Therefore, f,gHGR.

Proposition 8

If fHG(R,C), then f¯HG(R,C).

Proof

From Proposition 5, f¯^(t)=f^(-t)¯ for fL2(R,C). Then, using Equation (7), we havef¯HG(R,C)2=f¯,f¯HG(R,C)=-f¯^(t)2expσ22t2dt=-f^(-t)¯2expσ22t2dt=-f^(t)2expσ22t2dt=fHG(R,C)2<.

Therefore, f¯HG(R,C).

Here, we denote by Re and Im the real part and the imaginary part of a complex number, respectively. We also denote by Cl(A) the closure of a subset A in a topological space.

Corollary 3

If fHG(R,C), then Re(f),Im(f)HG.

Proof

If fHG(R,C), then f¯HG(R,C) by Proposition 8. Hence we see thatRe(f)=f+f¯2HG,Im(f)=f-f¯2-1HG.

This completes the proof.

Remark 3

If fHG(R,C), then there uniquely exist f1,f2HG such that f=f1+-1f2 by Corollary 3. This means that HG(R,C)=HG-1HG, where denotes the direct sum.

Proposition 9

For any fL2(R,C) and for any ϵ>0, there exists gHG(R,C) such that f-g2<ϵ. In other words, HG(R,C) is dense in L2(R,C).

Proof

Let C0(R,C) denote the space of continuous complex-valued functions with compact support on R. Let us define H^G(R,C) byH^GR,C:=hL2(R,C)-h(t)2expσ22t2dt<.

Note that H^G(R,C) coincides with the image of HG(R,C) by the Fourier transform. Then, C0(R,C)H^G(R,C)L2(R,C) and Cl(C0(R,C))=L2(R,C). Hence Cl(H^G(R,C))=L2(R,C). In other words, for any fL2(R,C) and for any ϵ>0, there exists g^H^G(R,C) such that f^-g^2<ϵ because f^L2(R,C), which implies that there exists gHG(R,C) such that f-g2<ϵ. This completes the proof.

The following corollary has also been shown in Theorem 4.63 in Steinwart and Christmann (Citation2008).

Corollary 4

Cl(HG)=L2(R).

Proof

From Proposition 9, for any fL2(R)L2(R,C) and for any ϵ>0, there exists gHG(R,C) such that f-g2<ϵ. By Remark 3, there exist g1,g2HG such that g=g1+-1g2. Thus,ϵ2>f-g22=-f-g2dx=-(f-g1)--1g22dx-f-g12dx=f-g122.

Therefore, f-g12<ϵ. This completes the proof.

Definition 11

Let us define r,rnL2(R) as r(t):=1t1(1,)(|t|),rn(t):=1t1(1,n)(|t|),

where 1(1,) and 1(1,n) denote the indicator functions of the intervals (1,) and (1,n), respectively. We also put hn:=rnˇ and h:=rˇ. Note that limnrn=rL2(R), becauselimnrn-r22=2limnn1x2dx=0.

Proposition 10

hn,hL2(R).

Proof

It is obvious that hn,hL2(R,C). Since rnL1(R)L2(R), we see thathn(x)¯=rnˇ(x)¯=12π-rn(t)exp-1txdt¯=12π-rn(t)exp--1txdt=12π-rn(-t)exp-1txdt=12π-rn(t)exp-1txdt=rnˇ(x)=hn(x),

where t=-t. Hence, hnL2(R). On the other hand,h(x)=limn12π-nnr(t)exp-1txdt=limn12π-rn(t)exp-1txdt=limnhn(x).

Therefore hL2(R).

Let us define ka(·):=2πσkG(·,a)=exp-(·-a)22σ2HG for aR. Now, we prove that kaRan(CXX) for any aR. This implies that CXX is not surjective.

Proposition 11

For any aR, kaHG\ran(CXX).

Proof

Suppose that there exists gHG such that CXXg=ka. Then, for any fHG,(12) ka,fHG=CXXg,fHG.(12)

Let us put k(·)=2πσkG(·,0)=exp-(·-0)22σ2. From Proposition 4, k^(t)=σexp-σ22t2. Then, using Equation (7) and Proposition 6, the left hand side of Equation (12) equals-k^a(t)f^(t)¯expσ22t2dt=-exp--1atk^(t)f^(t)¯expσ22t2dt=σ-exp--1atf^(t)¯dt.

The right hand side of Equation (12) is equal toEg(X)f(X)=-g(x)f(x)p(x)dx=gp,fL2(R).

Thus, Equation (12) is equivalent to the following equation:(13) gp,fL2(R)=σ-exp--1atf^(t)¯dt.(13)

Let us define hn,a(x):=hn(x-a) and ha(x)=h(x-a). Then hn,a,haL2(R). It is easy to see that hn,a-ha2=hn-h2=rn-r20 as n. Hence limnhn,a=ha in L2(R). Since h^n,a(t)=exp(--1at)h^n(t) by Proposition 6, we have-h^n,a(t)2expσ22t2dt=-h^n(t)2expσ22t2dt=-rn(t)2expσ22t2dt=21n1t2expσ22t2dt<,

which indicates that hn,aHG. Substituting hn,a for f, Equation (13) becomes(14) gp,hn,aL2(R)=σ-exp--1ath^n,a(t)¯dt.(14)

If n goes to infinity, the left hand side of Equation (14) becomes gp,haL2(R)R. On the other hand, the right hand side of Equation (14) becomesσ-exp--1atexp(--1at)h^n(t)¯dt=σ-h^n(t)¯dt=σ-rn(t)¯dt=2σ1n1tdt(n).

This is a contradiction. Therefore, there exists no gHG such that CXXg=ka. This completes the proof.

7. Supplementary material

Supplementary material for this article can be accessed here http://dx.doi.org/10.1080/23311835.2018.1447220.

Supplemental material

Suppl.pdf

Download ()

Additional information

Funding

Kazunori Nakamoto was partially supported by JSPS KAKENHI [grant numbers JP23540044, JP15K04814].

Notes on contributors

Hisashi Johno

Hisashi Johno is a PhD student at the Department of Mathematical Sciences, Faculty of Medicine, University of Yamanashi, Japan. His current research interests include probability theory and interpretable machine learning.

Kazunori Nakamoto

Kazunori Nakamoto is a professor of mathematics at Center for Medical Education and Sciences, Faculty of Medicine, University of Yamanashi, Japan. His main research interests include algebraic geometry, invariant theory, and the moduli of representations.

Tatsuhiko Saigo

Tatsuhiko Saigo is an associate professor of probability and statistics at Center for Medical Education and Sciences, Faculty of Medicine, University of Yamanashi, Japan. His research fields include probability theory, statistics, and applied mathematics.

References

  • Fukumizu, K. (2014). Introduction to kernel methods (in Japanese). Tokyo: Asakura Shoten.
  • Fukumizu, K., Song, L., & Gretton, A. (2013). Kernel bayes’ rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14, 3753–3783.
  • Horn, R. A. , & Johnson, C. R. (2013). Matrix analysis (2nd ed.). Cambridge: Cambridge University Press.
  • Kanagawa, M., Nishiyama, Y., Gretton, A. , & Fukumizu, K. (2014). Monte carlo filtering using kernel embedding of distributions. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (p. 1897-1903).
  • Rudin, W. (1987). Real and complex analysis (3rd ed.). New York, NY: McGraw-Hill Book Co..
  • Song, L., Fukumizu, K., & Gretton, A. (2014). Kernel embeddings of conditional distributions. IEEE Signal Processing Magazine, 30, 98–111.
  • Song, L., Huang, J., Smola, A. , & Fukumizu, K. (2009). Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning (p. 961-968).
  • Steinwart, I. , & Christmann, A. (2008). Support vector machines. New York, NY: Springer.