2,581
Views
26
CrossRef citations to date
0
Altmetric
PURE MATHEMATICS

Approximation by finite mixtures of continuous density functions that vanish at infinity

, ORCID Icon, & | (Reviewing editor)
Article: 1750861 | Received 04 Dec 2019, Accepted 29 Mar 2020, Published online: 28 Apr 2020

Abstract

Given sufficiently many components, it is often cited that finite mixture models can approximate any other probability density function (pdf) to an arbitrary degree of accuracy. Unfortunately, the nature of this approximation result is often left unclear. We prove that finite mixture models constructed from pdfs in C0 can be used to conduct approximation of various classes of approximands in a number of different modes. That is, we prove approximands in C0 can be uniformly approximated, approximands in Cb can be uniformly approximated on compact sets, and approximands in Lp can be approximated with respect to the Lp, for p1,. Furthermore, we also prove that measurable functions can be approximated, almost everywhere.

PUBLIC INTEREST STATEMENT

Finite mixture models are an expansive and expressive class of probability models that have been successfully applied in many situations where data follow a complex generative process that may be highly heterogeneous. It has long been known that finite mixture models, under sufficient regularity conditions, can approximate any probability density functions to arbitrary degrees of accuracy, and such results have been established under varying assumptive restrictions. Our work seeks to provide the weakest set of assumptions in order to establish approximation theoretic results over the widest class of probability density problems, possible. The result provides further evidence towards the success of mixture models in applications, and provides mathematical guarantees to practitioners who apply mixture models in their analytic problems.

1 Introduction

Let x be an element in the Euclidean space, defined by Rn and the norm 2, for some nN. Let f:RnR be a function, such that f0, everywhere, and fdλ=1, where λ is the Lebesgue measure. We say that f is a probability density function (pdf) on the domain Rn (an expression that we will drop, from hereon in). Let g:RnR be another pdf, and for each mN, define the functional class:

Mmg=h:hx=i=1mci1σingxμiσi,μiRn,σiR+,cSm1,im,

where cT=(c1,,cm), R+=0,,

Sm1=cRm:i=1mci=1 and ci0,im,

m=1,,m, and hmϕMmϕ is the matrix transposition operator. We say that any hMmg is a mcomponent location-scale finite mixture of the pdf g.

The study of pdfs in the class Mmg is an evergreen area of applied and technical research, in statistics. We point the interested reader to the many comprehensive books on the topic, such as Everitt & Hand (Citation1981), Titterington et al. (Citation1985), McLachlan & Basford (Citation1988), Lindsay (Citation1995), McLachlan & Peel (Citation2000), Fruwirth-Schnatter (Citation2006), Schlattmann (Citation2009), Mengersen et al. (Citation2011), and Fruwirth-Schnatter et al. (Citation2019).

Much of the popularity of finite mixture models stem from the folk theorem, which states that for any density f, there exists an hMmg, for some sufficiently large number of components mN, such that h approximates f arbitrarily closely, in some sense. Examples of this folk theorem come in statements such as “provided the number of component densities is not bounded above, certain forms of mixture can be used to provide arbitrarily close approximation to a given probability distribution”(Titterington et al., Citation1985, p. 50), “the [mixture] model forms can fit any distribution and significantly increase model fit” (Walker and Ben-Akiva, Citation2011, p. 173), and “a mixture model can approximate almost any distribution” (Yona, Citation2011, p. 500). Other statements conveying the same sentiment are reported in Nguyen & McLachlan (Citation2019). There is a sense of vagary in the reported statements, and little is ever made clear regarding the technical nature of the folk theorem.

In order to proceed, we require the following definitions. We say that f is compactly supported on KRn, if K is compact and if 1KCf=0, where 1X is the indicator function that takes value 1 when xX and 0, elsewhere, and C is the set complement operator (i.e., XC=RnX). Here, X is a generic subset of Rn. Furthermore, we say that fLpX for any 1p<, if

fLpX=1Xfpdλ1/p<,

and for p=, if

fLX=infa0:λxX:f(x)>a=0<,

where we call LpX the Lpnorm on X. When X=Rn, we shall write LpRn=Lp. In addition, we define the so-called Kullback-Leibler divergence, see Kullback & Leibler, (Citation1951), between any two pdfs f and g on X as

KLXf,g=1Xflogfgdλ.

In Nguyen & McLachlan (Citation2019), the approximation of pdfs f by the class Mmg was explored in a restrictive setting. Let hmg be a sequence of functions that draw elements from the nested sequence of sets Mmg (i.e., h1gM1g,h2gM2g,). The following result of Zeevi & Meir (Citation1997) was presented in Nguyen & McLachlan (Citation2019), along with a collection of its implications, such as the results of from Li & Barron (Citation1999) and Rakhlin et al. (Citation2005).

Theorem 1 (Zeevi and Meir, Citation1997). If

ff:1Kfβ,β>0L2K

and g are pdfs and K is compact, then there exists a sequence hmg such that

limmfhmgL2K=0andlimmKLKf,hmg=0.

Although powerful, this result is restrictive in the sense that it only permits approximation in the L2 norm on compact sets K, and that the result only allows for approximation of functions f that are strictly positive on K. In general, other modes of approximation are desirable, in particular, approximation in Lpnorm for p=1 or p= are of interest, where the latter case is generally referred to as uniform approximation. Furthermore, the strict-positivity assumption, and the restriction on compact sets limits the scope of applicability of Theorem 1. An example of an interesting application of extensions beyond Theorem 1 is within the L1norm approximation framework of Devroye & Lugosi (Citation2000).

Let g:RnR again be a pdf. Then, for each mN, we define

Nmg=h:hx=i=1mci1σingxμiσi,μiRn,σiR+,ciR,im,

which we call the set of mcomponent location-scale linear combinations of the pdf g. In the past, results regarding approximations of pdfs f via functions ηNmg have been more forthcoming. For example, in the case of g=ϕ, where

(1) ϕx=2πn/2expx22/2,(1)

is the standard normal pdf. Denoting the class of continuous functions with support on Rn by C. We have the result that for every pdf f, compact set KRn, and ϵ>0, there exists an mN and hNmϕ, such that fhLK<ϵ (Sandberg, 2001, Lem. 1). Furthermore, upon defining the set of continuous functions that vanish at infinity by

{C_0} = \left\{ {f \in C:\forall \epsilon &\gt 0,\exists \,\,{\rm{a}}\,\,{\rm{compact}}\,\,\mathbb{K} \subset {{\mathbb R}^{\rm n}}{\rm{,}}\,\,{\rm{such}}\,\,{\rm{that}}\,\,{{\left\| {\rm f}\right\|}_{{\mathcal{L}_\infty }\left({{\mathbb{K}^{\rm C}}} \right)}} \lt \epsilon } \right\}{\rm{,}}

we also have the result: for every pdf fC0 and ϵ>0, there exists an mN and hNmϕ, such that fhL<ϵ[32, Thm. 2]. Both of the results from Sandberg (Citation2001) are simple implications of the famous Stone-Weierstrass theorem (cf. Stone, Citation1948 and De Branges, Citation1959).

To the best of our knowledge, the strongest available claim that is made regarding the folk theorem, within a probabilistic or statistical context, is that of DasGupta (Citation2008, Thm. 33.2). Let ηmg be a sequence of functions that draw elements from the nested sequence of sets Nmg, in the same manner as hmg. We paraphrase the claim without loss of fidelity, as follows:

Claim 1. If f,gC are pdfs and KRn is compact, then there exists a sequence ηmg, such that

limmfηmgLK=0.

Unfortunately, the proof of Claim 1 is not provided within DasGupta (Citation2008). The only reference of the result is to an undisclosed location in Cheney & Light (Citation2000), which, upon investigation, can be inferred to be Theorem 5 of Cheney & Light (Citation2000), Ch. 20. It is further notable that there is no proof provided for the theorem. Instead, it is stated that the proof is similar to that of Theorem 1 in Cheney & Light (Citation2000, Ch. 24), which is a reproduction of the proof for Xu et al. (Citation1993, Lem. 3.1).

There is a major problem in applying the proof technique of Xu et al. (Citation1993, Lem. 3.1) in order to prove Claim 1. The proof of [Xu et al. (Citation1993, Lem. 3.1)] critically depends upon the statement that “there is no loss of generality in assuming that fx=0 for xRn2K“. Here, for aR+, aK=xRn:x=ay,yK. The assumption is necessary in order to write any convolution with f and an arbitrary continuous function as an integral over a compact domain, and then to use a Riemann sum to approximate such an integral. Subsequently, such a proof technique does not work outside the class of continuous functions that are compactly supported on aK. Thus, one cannot verify Claim 1 from the materials of Xu & Light, (Citation1993), Cheney & Light (Citation2000), and DasGupta (Citation2008), alone.

Some recent results in the spirit of Claim 1 have been obtained by Nestoridis & Stefanopoulos, (Citation2007) and Nestoridis et al. (Citation2011), using methods from the study of universal series (see, for example, in Nestoridis & Papadimitropoulos, Citation2005).

Let

W=fC0:yZnsupx0,1nfx+y<

denote the so-called Wiener’s algebra (see, e.g. Feichtinger, Citation1977)) and let

V=fC0:xRn,fxβ1+x2nθ,β,θR+

be a class of functions with tails decaying at a faster rate than ox2n. In Nestoridis et al. (Citation2011), it is noted that VW. Further, let

{C_C} = \left\{ {f \in C:\exists \,\,{\rm{a}}\,{\rm{compact}}\,{\rm{set}}\,\mathbb{K}{\rm{,}}\,\,{\rm{such}}\,\,{\rm{that}}\,\,{{\bf{1}}_{{\mathbb{K}^}}}\rm f = 0} \right\}{\rm{,}}

denote the set of compactly supported continuous functions. The following theorem was proved in Nestoridis & Stefanopoulos (Citation2007).

Theorem 2 (Nestoridis and Stefanopoulos, Citation2007, Thm. 3.2). If gV, then the following statements hold.

(a) For any fCc, there exists a sequence ηmg (ηmgNmg), such that

limmfηmgL1+fηmgL=0.

(b) For any fC0, there exists a sequence ηmg (ηmgNmg), such that

limmfηmgL=0.

(c) For any 1p< and fLp, there exists a sequence ηmg (ηmgNmg), such that

limmfηmgLp=0.

(d) For any measurable f, there exists a sequence ηmg (ηmgNmg), such that

limmηmg=f,almosteverywhere.

(e) If ν is a σfinite Borel measure on Rn, then for any νmeasurable f, there exists a sequence ηmg (ηmgNmg), such that

limmηmg=f,

almost everywhere, with respect to ν.

The result was then improved upon, in Nestoridis et al. (Citation2011), whereupon the more general space W was taken as a replacement for V, in Theorem 2. Denote the class of bounded continuous functions by Cb=CL. The following theorem was proved in Nestoridis et al. (Citation2011).

Theorem 3 (Nestoridis et al., Citation2011, Thm. 3.2). If gW, then the following statements are true.

(a) The conclusion of Theorem 2(a) holds, with Cc replaced by C0L1.

(b) The conclusions of Theorem 2(b)–(e) hold.

(c) For any fCb and compact KRn, there exists a sequence ηmg, such that

limmfηmgLK=0.

Utilizing the techniques from Nestoridis & Stefanopoulos, (Citation2007), Bacharoglou, (Citation2010) proved a similar set of results to Theorem 2, under the restriction that f is a non-negative function with support R, using g=ϕ (i.e. g has form (1), where n=1) and taking hmϕ as the approximating sequence, instead of ηmg. That is, the following result is obtained.

Theorem 4 (Bacharoglou, Citation2010, Cor. 2.5). If f:RR+0, then the following statements are true.

(a) For any pdf fCc, there exists a sequence hmϕ (hmϕMmϕ), such that

limmfhmϕL1+fhmϕL=0.

(b) For any fC0, such that fL11, there exists a sequence hmϕ (hmϕMmϕ), such that

limmfhmϕL=0.

(c) For any 1<p< and fCLp, such that fL11, there exists a sequence hmϕ (hmϕMmϕ), such that

limmfhmϕLp=0.

(d) For any measurable f, there exists a sequence hmϕ (hmϕMmϕ), such that

limmhmϕ=f,almosteverywhere.

(e) For any pdf fC, there exists a sequence hmϕ (hmϕMmϕ), such that

limmfhmϕL1=0.

To the best of our knowledge, Theorem 4 is the most complete characterization of the approximating capabilities of the mixture of normal distributions. However, it is restrictive in two ways. First, it does not permit the characterization of approximation via the class Mmg for any g except the normal pdf ϕ. Although ϕ is traditionally the most common choice for g in practice, the modern mixture model literature has seen the use of many more exotic component pdfs, such as the student-t pdf and its skew and modified variants (see, e.g. Peel & McLachlan, Citation2000, Forbes & Wraith, Citation2013, and Lee & McLachlan, Citation2016). Thus, its use is somewhat limited in the modern context. Furthermore, modern applications tend to call for n>1, further restricting the impact of the result as a theoretical bulwark for finite mixture modeling in practice. A remark in Bacharoglou, (Citation2010) states that the result can be generalized to the case where gV instead of g=ϕ. However, no suggestions were proposed, regarding the generalization of Theorem 4 to the case of n>1.

In this article, we prove a novel set of results that largely generalize Theorem 4. Using techniques inspired by Donahue et al. (Citation1997) and Cheney & Light, (Citation2000), we are able to obtain a set of results regarding the approximation capability of the class of mcomponent mixture models Mmg, when gC0 or gV, and for any nN. By definition of V, the majority of our results extend beyond the proposed possible generalizations of Theorem 4.

The article proceeds as follows: Our main theorem is stated and its separate parts are proved in Section 2. Comments and discussion are provided in Section 3. Necessary technical lemmas and results are also included, for reference, in the Appendix.

2. Main result

The remainder of the article is devoted to proving the following theorem.

Theorem 5 (Main result). If we assume that f and g are pdfs and that gC0, then the following statements are true.

(a) For any fC0, there exists a sequence hmg (hmgMmg), such that

limmfhmgL=0.

(b) For any fCb and compact KRn, there exists a sequence hmg (hmgMmg), such that

limmfhmgLK=0.

(c) For any 1<p< and fLp, there exists a sequence hmg (hmgMmg), such that

limmfhmgLp=0.

(d) For any measurable f, there exists a sequence hmg (hmgMmg), such that

limmhmg=f,almosteverywhere.

(e) If ν is a σfinite Borel measure on Rn, then for any νmeasurable f, there exists a sequence hmg (hmgMmg), such that

limmhmg=f,

almost everywhere, with respect to ν.

If we assume instead that gV, then the following statement is also true.

(f) For any fC, there exists a sequence hmg (hmgMmg), such that

limmfhmgL1=0.

2.1. Technical preliminaries

Before we begin to prove the main theorem, we establish some technical results regarding our class of component densities C0. Let f,gL1 and denote the convolution of f and g by fg=gf. Further, we denote the sequence of dilates of g by gk:gkx=kngkx,kN. The following result is an alternative to Lemma 5 and Corollary 1. Here, we replace a boundedness assumption on the approximand, in the aforementioned theorem by a vanishing at infinity assumption, instead.

Lemma 1. Let g be a pdf and fC0, such that fL>0. Then,

limkgkffL=0.

Proof. It suffices to show that for any ϵ>0, there exists a kϵN, such that gkffL<ϵ, for all kkϵ. By Lemma 6, fCb, and thus fL<. By making the substitution z=kx, we obtain for each k

gkxdλ=kngkxdλ=gzdλ=1.

By Corollary 1, we obtain limk1x:x2>δgkdλ=0 and thus we can choose a kϵ, such that

1x:x2>δgkdλ<ϵ4fL.

Since g is a pdf, we have

gkf(x)f(x)=gkyfxyfxdλygkyfxyfxdλy.

By uniform continuity, for any ϵ>0, there exists a δϵ>0 such that fxyfx<ε/2, for any x,yRn, such that y2<δϵ (Lemma 6). Thus, on the one hand, for any δϵ, we can pick a kϵ such that

1y:y2>δϵgkyfxyfxdλy
2fL1y:y2>δϵgkdλ
(2) 2fL×ϵ4fL=ϵ2,(2)

and on the other hand

1y:y2δϵgkyfxyfxdλy
ϵ21y:y2δϵgkdλ
(3) ϵ2×1=ϵ2.(3)

The proof is completed by summing (2) and (3).□

Lemma 2. If fC0 is such that f0, and ϵ>0, then there exists a hCc, such that 0hf, and

fhL<ϵ

Proof. Since fC0, there exists a compact KRn such that fLKC<ϵ/2. By Lemma 7, there exists some gCc, such that 0g1 and 1Kg=1. Let h=gf, which implies that h0 and 0hf. Furthermore, notice that 1Kfh=0 and hLfL, by construction. The proof is completed by observing that

fh L=fh LKC
             fLKC+hLKC
\ \ \ \ \ \ \ \ \ \ \ \ \ \le 2{\left\| f\right\|_{{\mathcal{L}_\infty }\left({{\mathbb{K}^c}} \right)}} &\lt \,\epsilon {\rm{.}}

For any δ>0, uniformly continuous function f, let

wf,δ=supx,yRn:xy2δfxfy

denote the modulus of continuity of f. Furthermore, define the diameter of a set XRn by diamX=supx,yXxy2 and denote an open ball, centered at xRn with radius r>0 by Bx,r=yRn:xy2<r.

Notice that the class Mmg can be parameterized as

Mmg=h:hx=i=1mcikingkixzi,ziRn,kiR+,cSm1,im,

where ki=1/σi and zi=μi/σi. The following result is the primary mechanism that permits us to construct finite mixture approximations for convolutions of form gkf. The argument motivated by the approaches taken in Theorem 1 in Cheney & Light (Citation2000, Ch. 24),  Nestoridis & Stefanopoulos (Citation2007, Lem. 3.1), and Nestoridis et al. (Citation2011, Thm. 3.1).

Lemma 3. Let fC and gC0 be pdfs. Furthermore, let KRn be compact and hCc, where 1Kch=0 and 0hf. Then for any kN, there exists a sequence hmg, such that

limmgkhhmgL=0.

Proof. It suffices to show that for any kN and ϵ>0, there exists a sufficiently large enough m(ϵ)N so that for all m\gem(ϵ),hmgMmg such that

(4) gkhhmgL<ϵ.(4)

For any kN, we can write

gkh(x)=gkxyhydλy
              =1y:yKgkxyhydλy
              =1y:yKkngkxkyhydλy
              =1z:zkKgkxzhzkdλz.

Here, kK is continuous image of a compact set, and hence is compact (cf. Rudin, Citation1976, Thm. 4.14]). By Lemma 8, for any δ>0, there exists κiRn (im1, mN), such that kKi=1m1Bκi,δ/2. Further, if Biδ=kKBκi,δ/2, then we have kK=i=1m1Biδ. We can obtain a disjoint covering of kK by taking A1δ=B1 and Aiδ=Biδj=1i1Bjδ (im1) and noting that kK=i=1m1Aiδ, by construction (cf. Cheney & Light, Citation2000, Ch. 24). Furthermore, each Aiδ is a Borel set and diamAiδδ.

For convenience, let Πmδ=Aiδ:im1 denote the disjoint covering, or partition, of kK. We seek to show that there exists an mN and Πmδ, such that

gkhi=1mcikingkixziL<ϵ,

where ki=k,

ci=kn1z:zAiδhz/kdλ(z),

and ziAiδ, for im1.

Further, zmAm1δ and cm=1i=1m1ci, with km chosen as follows: By Lemma 6, gC< for some positive C. Then, cmkmngkmxzmLcmkmnC. We may choose km so that kmn=ϵ/2cmC, so that

cmkmngkmxzmLϵ2.

Since 0hf, the sum of ci (im1) satisfies the inequality

i=1m1ci=kni=1m11z:zAiδhzkdλ
        =kn1z:zkKhzkdλ
        =1x:xKhdλ1x:xKfdλfdλ=1.

Thus, 0cm1, and our construction implies that hmgMmg, where

hmgx=i=1mcikingkixzixRn.

We can bound the left-hand side of (4) as follows:

gkhhgmL
gkhxi=1m1cikingkixziL
+cmkmngkmxzmL
gkhxi=1m1cikingkixziL+ϵ2
=∥1z:zkKgkxzhzkdλz
  i=1m11z:zAiδgkxzihzkdλzL+ϵ2
(5) i=1m11z:zAiδgkxzgkxziLhzkdλz+ϵ2.(5)

Since

kxzkxzi2=zzi2diamAiδδ,

we have gkxzgkxziwg,δ, for each im1. Since limδ0wg,δ=0 (cf. Makarov & Podkorytov, Citation2013, Thm. 4.7.3), we may choose a δϵ>0 so that wg,δϵ<ϵ/2kn. We may proceed from (5) as follows:

gkhhgmLwg,δϵ1z:zkKhzkdλ+ϵ2
                      =wg,δϵknhdλ+ϵ2
                      wg,δϵkn+ϵ2
(6)                        <ϵ2+ϵ2=ϵ.(6)

To conclude the proof, it suffices to choose an appropriate sequence of partitions Πmδϵ,m\gem(ϵ), for some large but finite m(ϵ), so that (5) and (6) hold, which is possible by Lemma 8.□

For any rN, let Bˉr=xRn:x2r be a closed ball of radius r, centered at the origin.

Lemma 4. If fL1, such that f0, then

limrf1BˉrfL1=0.

Proof. By construction, each element of the sequence 1Bˉrf (rN) is measurable, 01Bˉrff, and

limr1Bˉrf=f,

point-wise. We obtain our conclusion via the Lebesgue dominated convergence theorem.□

2.2. Proof of theorem 5(a)

We now proceed to prove each of the parts of Theorem 5. To prove Theorem 5(a) it suffices to show that for every ϵ>0, there exists a hmgMmg, such that fhmgL<ϵ.

Start by applying Lemma 2 to obtain hCc, such that 0hf and fhL<ϵ/2. Then, we have

fhmgLfhL+hhmgL
(7)                 <ϵ2+hhmgL.(7)

The goal is to find a hmg, such that hhmgL<ϵ/2. Since hCc, we may find a compact KRn such that hLKC=0. Apply Lemma 1 to show the existence of a kϵ, such that

hgkhL<ϵ4,

for all kkϵ. With a fixed k=kϵ, apply Lemma 3 to show that there exists a hmgMmg, such that

gkεhhmgL<ϵ4.

By the triangle inequality, we have

hhmgLhgkϵhL+gkϵhhmgL
(8)                  <ϵ4+ϵ4=ϵ2.(8)

The proof is complete by substitution of (8) into (7).

2.3. Proof of Theorem 5(b)

For any ϵ>0 and compact KRn, it suffices to show that there exists a sufficiently large enough m(ϵ)N so that for all m\gem(ϵ),hmgMmg, such that fhmgL(K)<ϵ.

By Lemma 5, we can find a kϵ,KN, such that

(9) fgkfLK<ϵ3,(9)

for every kkϵ,K. Since gC0, gLC< for some positive C, by Lemma 6. For any k,rN, via Young’s convolution inequality:

(10) \Vert g_{k}\star f-g_{k}\star\left(\mathbf{1}_{\bar{\mathbb{B}}_{r}}f\right)\Vert _{\mathcal{L}_{\infty}} \le k^{n}C\int\left(\mathbf{1}_{\bar{\mathbb{B}}_{r}^{\complement}}f\right)\text{d}\lambda=k^{n}C\Vert f-{\rm {1}}_{\Bar {\mathbb{B}}_{r}}\,{f\Vert _{{\mathcal L}_{\rm 1}}\text{.}(10)

For fixed k, we may choose rϵ,KN, using Lemma 4, so that f1BˉrfL1ϵ/3knC and thus the final term of (10) is bounded from above by ϵ/3 for all rrϵ,K. Thus, for k=kϵ,K and, rrϵ,K

(11) gkϵ,Kfgkϵ,K1Bˉrϵ,KfLϵ3.(11)

Using Lemma 3, with approximand 1Bˉrϵ,Kf, component density g, compact set Bˉrϵ,K, h=1Bˉrϵ,Kf, and with k=kϵ,K fixed, we have the existence of a density hmgMmg,m\gem(ϵ)N, such that

(12) gkϵ,K1Bˉrϵ,KfhmgLϵ3.(12)

We obtain the desired result by combining (9), (11), and (12), via the triangle inequality.

2.4. Proof of Theorem 5(c)

The technique used to prove Theorem 5(c) is different from those used in the previous sections. Here, we use a result of Donahue et al. (Citation1997) that generalizes the classic Barron-Jones Hilbert space approximation result (cf. Jones, Citation1992 and Barron, Citation1993) to Banach spaces.

To prove Theorem 5(c), it suffices to show that for every ϵ>0, there exists a sufficiently large enough m(ϵ)N so that for all m\gem(ϵ),hmgMmg such that fhmgLp<ϵ. Begin by applying Corollary 1 to obtain a kϵ, such that

(13) fgkfLp<ϵ2(13)

for all kkϵ.

For some pdf g and fixed kN, let us define the class

Ggk=h:hx=kngkxkμ, μRn,

write the mpoint convex hull of Ggk as

ConvmGgk=h:h=i=1mcigi,giGgk, cSm1, im,

and call ConvGgk=ConvGgk the convex hull of Ggk. We further say that ConvGgk is the closure of ConvGgk.

Because g is a pdf, gC0Cb, and CbL, we observe that gL1L. Thus, gLp, for any 1<p<, by Lemma 9. Since g is a pdf and fLp, we have the existence of gkf and the fact that gkfLp is finite.

Furthermore, for any ψGgk, since gLp and by definition of Ggk, we have ψLpkn/pgLp. Thus, we have

(14) ψgkfLpψLp+gkfLpK,(14)

by choosing K=kn/pgLp+gkfLp>0.

Following van de Geer (Citation2003), we can write the closure of Ggk as

ConvGgk=h:hx=kngkxkμfμdλμ,fisapdf,

and thus we immediately have gkfConvGgk. Combined with (14), we can apply Lemma 11 to obtain the conclusion that there exists a function hmgConvmGgkϵMmg, such that

hmggkϵfLpKCpm11/α,

where α=minp,2 and Cp is a finite constant. Since p>1, m11/α is strictly increasing, and hence we can choose an mϵN, such that for all m\gemϵ,

(15) hmggkϵfLpϵ2.(15)

The proof is then completed by combining (13) and (15) via the triangle inequality.

2.5. Proof of Theorem 5(d) and Theorem 5(e)

By Theorem 5(a), there exists a sequence hmg that uniformly converges to f, as m. Thus, by Lemma 12, hmg almost uniformly converges to f and also converges almost everywhere, to f, with respect to any measure ν. We prove Theorem 5(d) by setting ν=λ, and we prove Theorem 5(e) by not specifying ν.

2.6. Proof of Theorem 5(f)

It suffices to show that for any ϵ>0, there exists a sufficiently large enough m(ϵ)N so that for all m\gem(ϵ),hmgMmg, where gV, such that fhmgL1<ϵ. Begin by applying Lemma 4 in order to find a rϵN, for any ϵ>0, such that for all rrϵ,

(16) f1BˉrfL1ϵ24<ϵ2,(16)

where 01Bˉrff, and 1BˉrfCc with compact support Bˉr.

Let K=Bˉr and apply the triangle inequality to obtain

fhmgL1f1KfL1+1KfhmgL1
               ϵ2+1KfhmgL1.

Hence, we need to show that there exists a function hmgMmg, such that

1KfhmgL1ϵ2.

Since gV and gk(x)=kng(kx), by substitution, we have

(17) gkxβkθk1+x2n+θ,(17)

where β,θ>0 are independent of k. By Lemma 5 and Corollary 1, we can obtain a k1ϵ, such that for all kk1ϵ,

(18) 1Kfgk1KfL1ϵ4.(18)

Suppose that γ>1 and let

Kk=xRn:distx,Kkγ,

where

distx,X=infxy2:yX.

By construction, λKk=λK+Okγ and thus there exists a k2 such that λKkλK+1, for any kk2.

For any k>k2, we can show that

(19) gk1Kfhm1gL1Kk<ϵ8.(19)

To do so, firstly, for any xRn,

gk1Kf=1Kgkxyfydλy
             =1kKgkxzfzkdλz.

To obtain a Riemann sum approximation of gk1Kf, we use an argument analogous to that of Lemma 3. That is, we partition kK into m1 disjoint Borel sets Πm=A1,,Am1, and we approximate gk1Kf by a hm1gMm1g, where for each im1, ki=k, ziAi, and

ci=kn1Aifzkdλz.

Define kmR+, zmRn, and cm=1i=1m1ci, where

(20) cm=fdλ1Kfdλ=f1KfL1ϵ24(20)

by (16). Then, by a similar argument to Lemma 3, ci0 for all im and i=1mci=1. Thus, we may define an element hmgMmg via the parameters above.

For sufficiently large kk2, we use Lemma 3 to show that

gk1Kfhm1gLKk<ϵ8λK+1,

which implies

gk1Kfhm1gL1Kk<1Kkϵ8λK+1dλ
(21)                                   <ϵλKk8λK+1<ϵ8,(21)

and thus (19) is proved. Using (19), we write

gk1KfhmgL1
=gk1Kfhm1gcmkmngkmxzmL1
gk1Kfhm1gL1Kk
+ \left\| {{g_k}} \right. \star \left({{{\bf{1}}_{\mathbb K}f} \right) - h_{m - 1}^g\left\| {_{_{{\mathcal{L}_1}\left({\mathbb K}_k} \right)}}} \right.
+cmkmngkmxzmL1
\isin8+cm+ gk1Kf L1Kkc+hm1g L1Kkc,

where cmkmngkmxzmL1cm since kmngkmxzm is a pdf. The aim is now to prove that

\left\| {{g_k}} \right. \star \left({{{\bf{1}}_\mathbb{K}}f} \right)\left\| {_{{\mathcal{L}_1}\left({{{\mathbb K}_k}} \right)}} \right. &\lt {\epsilon \over {24}}\,{\rm{and}}\,h_{m - 1}^g\left\| {_{{\mathcal{L}_1}\left({{{\mathbb K}_k}} \right)}} \right. &\lt {\epsilon \over {24}}{\rm{.}}

Using polar coordinates and (17), we have

1x:xy2>kγgkxydλx
1x:xy2>kγβkθk1+xy2n+θdλx
=βAnkθ1kγ,rn1k1+rn+θdλr
βAnkθ1kγ,rθ1dλr
=βAnkθγ1/θ,

where An is the surface area of a unit sphere embedded in Rn. We then have

gk1Kf  L1Kkc
=1yK1xKkcfygkxydλxdλy
1KfL1yK1x:xy2>kγβkθk1+xy2n+θdλxdλy
1KfLλKβAnkθγ1/θ,

which implies that we can choose a k3N, such that for all kk3,

(22) \left\| {{g_k}} \right. \star \left({{{\bf{1}}_{\mathbb{K}}f}} \right)\| {_{_{{\mathcal{L}}_1\,\left({{{\mathbb K}_k^c}} \right)\, \,}\lt\,\,{ \epsilon \over {24}}{\rm{.}}(22)

Lastly, we write

\left\| {h_{m - 1}^g} \right.\left\| {_{{\mathcal{L}}_1\,\left({\mathbb K}_k^{\rm C}} \right)}} \right.
=1KkCi=1m1cikngkxzidλ
=i=1m11KkCkn1Aifzkdλzkngkxzidλx
1Kf Li=1m1knλAi1KkCgkxzikdλ
1KfLi=1m1knλAiβAnkθγ1θ
1KfLλKβAnkθγ1/θ,

which implies that we can choose the same k3 as above to obtain the bound

(23) \| {h_{m - 1}^g} \| {_{_{{\mathcal{L}}_1\,\left({{{\mathbb K}_k^c}} \right)}\ \lt\, { {\epsilon} \over {24}}{\rm{,}}(23)

for any kk3.

Thus, we obtain the bound 1KfhmgL1<ϵ/2, for all kmaxk1,k2,k3, by combining (18), (19), (20), (21), (22), and (23), via the triangle inequality. The result is proved by combing the bound above, with (16), for an appropriately large rϵN.

3. Comments and discussion

3.1. Relationship to Theorem 1

In the proof of Theorem 1, the famous Hilbert space approximation result of Jones (Citation1992) and Barron (Citation1993) was used to bound the L2 norm between any approximand fL2 and a convex combination of bounded functions in L2. This approximation theorem is exactly the p=2 case of the more general theorem of Donahue et al. (Citation1997), as presented in Lemma 11. Thus, one can view Theorem 5(c) as the p1, generalization of Theorem 1.

3.2. The class W is a proper subset of the class C0

Here, we comment on the nature of class W, which was investigated by Bacharoglou, (Citation2010) and Nestoridis et al. (Citation2011). We recall that Bacharoglou (Citation2010) conjectured that Theorem 4 generalizes from g=ϕ to gV. In Theorem 5(a)–(e), we assume that gC0. We can demonstrate that gC0 is a strictly weaker condition than gV or gW.

For example, consider the function in g:RR such that gx=0 if x<0 and

gx=i=122iixi+12i1i1x<i1/2+xi2i1i1/2x<iifx0,

and note that

11/2,1/22x2iidλ=12i2+i<1i2.

Since i=11/i2=π2/6, gL1. Furthermore, g is continuous since all stationary points of g are continuous. In R, gC0 if

limx±gx=0.

For x0, we observe that g=0 and thus the left limit is satisfied. On the right, for any 1/ϵ>0, we have xϵϵ1/2, so that gx<1/ϵ, for all x>xϵ, where is the ceiling operator. Therefore, gC0.

Within each interval i1x<i, we observe that g is locally maximized at x=i1/2. The local maximum corresponding to each of these points is 1/i. Thus gW, since

i=11i<yZsupx0,1gx+y,

where i=11/i=. Furthermore, gV since VW.

3.3. Convergence in measure

Along with the conclusions of Theorem 5(d) and (e), Lemma 12 also implies convergence in measure. That is, if ν is a σfinite Borel measure on Rn, then for any νmeasurable f, there exists a sequence hmg, such that for any ϵ>0,

limmυxRn:fxhmgxϵ=0.

A Technical results

Throughout the main text, we utilize a number of established technical results. For the convenience of the reader, we append these results within this Appendix. Sources from which we draw the unproved results are provided at the end of the section.

Lemma 5. Let gk be a sequence of pdfs in L1 and for every δ>0

limk1x:x2>δgkdλ=0.

Then, for all fLp and 1p<,

limkgkffLp=0.

Furthermore, for all fCb and any compact KRn,

limkgkffLK=0.

The sequences gk from Lemma 5 are often called approximate identities or approximations of the identity. A simple construction of approximate identities is by taking dilations gkx=kngkx, which yields the following corollary.

Corollary 1. Let g be a pdf. Then the sequence of dilations gk:gkx=kngkx, satisfies the hypothesis of Lemma 5 and hence permits its conclusion.

Lemma 6. The class C0 is a subset of Cb. Furthermore, if fC0, then f is uniformly continuous.

Lemma 7 (Urysohn’s Lemma). If KRn is compact, then there exists some gCc, such that 0g1 and 1Kg=1.

Lemma 8. If XRn is bounded, then for any r>0, X can be covered by i=1mBxi,r for some finite mN, where xiRn and im.

Lemma 9. If 0<p<q<r, then LpLrLq.

Let Γ:RR be the usual gamma function, defined as Γz=10,xz1expxdλ.

Lemma 10. If fLp and gL1, for 1p, then fg exists and we have fgLpgL1fLp.

Lemma 11. Let GLp, for some 1p<, and let fConvG. For any K>0, such that fαLp<K, for all αG, there exists a hmConvmG, such that

fhmLpCpKm11/α,

where α=minp,2, and

Cp=1   if1p2,2πΓp+121/pifp>2.

Lemma 12. In any measure ν, uniform convergence implies almost uniform convergence, and almost uniform convergence implies almost everywhere convergence and convergence in measure, with respect to ν.

B Sources of results

Lemma 5 is reported as Theorem 9.3.3 in Makarov & Podkorytov (Citation2013) (see also Theorem 2 of Rudin, Citation1976, Ch. 20). The proof of Corollary 1 can be taken from that of Theorem 4 of Cheney & Light, Citation2000, Ch. 20. Lemma 6 appears in Conway, Citation2012), as Proposition 1.4.5. Lemma 7 is taken from Corollary 1.2.9 of Conway (Citation2012). Lemma 8 appears as Theorem 1.2.2 in Conway (Citation2012). Lemma 9 can be found in Folland (Citation1999), Prop. 6.10. Lemma 10 can be found in [21, Thm. 9.3.1]. Lemma 11 appears as Corollary 2.6 in Donahue et al. (Citation1997). Lemma 12 can be obtained from the definition of almost uniform convergence, Lemma 7.10, and Theorem 7.11 of Bartle (Citation1995).

Acknowledgements

HDN is personally funded by Australian Research Council (ARC) grant DE170101134. HDN and GJM are supported by ARC grant DP180101192. FC is supported by Agence Nationale de la Recherche (ANR) grant SMILES ANR-18-CE40-0014 and by Région Normandie grant RIN.

Additional information

Notes on contributors

Geoffrey J. McLachlan

Mr. Nguyen, Dr. Nguyen, and Profs. Chamroukhi and McLachlan are each keenly interested in the study of finite mixture models and their modifications and extensions for various data analytic and machine learning applications. Their research spans the fields of computational statistics, mathematical statistics, machine learning and AI. Common threads of their research consist of the derivation of algorithms for the efficient estimation of mixture models and extensions, such as mixtures of experts, and mixtures of factor analysers; the derivation of theoretical results regarding the statistical and mathematical properties of such constructions and their estimators; and the application of mixture models to data analytic problems spanning the fields of biology, medical science, engineering, signal processing, among many others.

References

  • Bacharoglou, A. G. 2010. Approximation of probability distributions by convex mixtures of Gaussian measures. Proceedings of the American Mathematical Society, 138:2619–18.
  • Barron, A. R. (1993). Universal approximation bound for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. doi: 10.1109/18.256500
  • Bartle, R. G. (1995). The Elements of Integration and Lebesgue Measure. Wiley.
  • Cheney, W. & Light, W. (2000). A Course in Approximation Theory. Brooks/Cole.
  • Conway, J. B. (2012). A Course in Abstract Analysis. American Mathematical Society.
  • DasGupta, A. (2008). Asymptotic Theory Of Statistics And Probability. Springer.
  • De Branges, L. 1959. The Stone-Weierstrass theorem. Proceedings of the American Mathematical Society, 10:822–824.
  • Devroye, L. & Lugosi, G. (2000). Combinatorial Methods in Density Estimation. Springer.
  • Donahue, M. J., Gurvits, L., Darken, C. & Sontag, E. (1997). Rates of convex approximation in non-Hilbert spaces. Constructive Approximation, 13(2), 187–220. https://doi.org/10.1007/BF02678464
  • Everitt, B. S. & Hand, D. J. (1981). Finite Mixture Distributions. Chapman and Hall.
  • Feichtinger, H. G. (1977). A characterization of wiener’s algebra on locally compact groups. Archiv der Mathematik, 29(1), 136–140. https://doi.org/10.1007/BF01220386
  • Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications. Wiley.
  • Forbes, F. & Wraith, D. (2013). A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweights. Statistics and computing, 24(6), 971–984. https://doi.org/10.1007/s11222-013-9414-4
  • Fruwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer.
  • Fruwirth-Schnatter, S., Celeux, G., & Robert, C. P. (editors). (2019). Handbook of Mixture Analysis. CRC Press.
  • Jones, L. K. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of statistics, 20(1), 608–613. https://doi.org/10.1214/aos/1176348546
  • Kullback, S. & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694
  • Lee, S. X. & McLachlan, G. J. (2016). Finite mixtures of canonical fundamental skew t-distributions: the unification of the restricted and unrestricted skew t-mixture models. Statistics and computing, 26(3), 573–589. https://doi.org/10.1007/s11222-015-9545-x
  • Li, J. Q. & Barron, A. R. (1999). Mixture density estimation. In S. A. Solla, T. K. Leen, & K. R. Mueller (Eds.), Advances in neural information processing systems (Vol. 12, pp. 279–285). MIT Press.
  • Lindsay, B. G. 1995. Mixture models: theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics. Hayward.
  • Makarov, B. & Podkorytov, A. (2013). Real Analysis: Measures, Integrals and Applications. Springer.
  • McLachlan, G. J. & Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker.
  • McLachlan, G. J. & Peel, D. (2000). Finite Mixture Models. Wiley.
  • Mengersen, K. L., Robert, C. & Titterington, M. (2011). Mixtures: Estimation and Applications. Wiley.
  • Nestoridis, V. & Papadimitropoulos, C. (2005). Abstract theory of universal series and an application to Dirichlet series. Comptes Rendus Academy of Science Paris Series I, 341(9), 530–543. doi: 10.1016/j.crma.2005.09.028
  • Nestoridis, V., Schmutzhard, S. & Stefanopoulos, V. (2011). Universal series induced by approximate identities and some relevant applications. Journal of Approximation Theory, 163(12), 1783–1797. https://doi.org/10.1016/j.jat.2011.06.001
  • Nestoridis, V. & Stefanopoulos, V. 2007. Universal series and approximate identities. Technical report.
  • Nguyen, H. D. & McLachlan, G. J. (2019). On approximations via convolution-defined mixture models. Communications in Statistics - Theory and Methods, 48(16), 3945–3955. In press. https://doi.org/10.1080/03610926.2018.1487069
  • Peel, D. & McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics and computing, 10(4), 339–348. https://doi.org/10.1023/A:1008981510081
  • Rakhlin, A., Panchenko, D. & Mukherjee, S. (2005). Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9, 220–229. https://doi.org/10.1051/ps:2005011
  • Rudin, W. (1976). Principles of Mathematical Analysis. McGraw-Hill.
  • Sandberg, I. W. (2001). Gaussian radial basis functions and inner product space. Circuits, Systems and Signal Processing, 20(6), 635–642. https://doi.org/10.1007/BF01270933
  • Schlattmann, P. (2009). Medical Applications of Finite Mixture Models. Springer.
  • Stone, M. H. 1948. The generalized Weierstrass approximation theorem. Mathematical Magazine, 21:237–254.
  • Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley.
  • van de Geer, S. (2003). Asymptotic theory for maximum likelihood in nonparametric mixture models. Computational Statistics and Data Analysis, 41(3–4), 453–464. https://doi.org/10.1016/S0167-9473(02)00188-3
  • Walker, J. L., & Ben-Akiva, M. (2011). Advances in discrete choice: mixture models. In A. De Palma, R. Lindsey, E. Quinet, & R. Vickerman (Eds.), A Handbook of transport economics (pp. 160–187). Edward Edgar.
  • Xu, Y., & Light, W. A. (1993). and E W Cheney. Constructive methods of approximation by ridge functions and radial functions. Numerical Algorithms, 4(2), 205–223. https://doi.org/10.1007/BF02144104
  • Yona, G. (2011). Introduction to Computational Proteomics. CRC Press.
  • Zeevi, A. J., & Meir, R. (1997). Density estimation through convex combinations of densities: approximation and estimation bounds. Neural computation, 10(1), 99–109. doi: 10.1016/S0893-6080(96)00037-8