799
Views
0
CrossRef citations to date
0
Altmetric
Theory and Methods

Extreme Value Statistics in Semi-Supervised Models

, & ORCID Icon
Received 30 Jan 2022, Accepted 10 Mar 2024, Published online: 15 May 2024

Abstract

We consider extreme value analysis in a semi-supervised setting, where we observe, next to the n data on the target variable, n + m data on one or more covariates. This is called the semi-supervised model with n labeled and m unlabeled data. By exploiting the tail dependence between the target variable and the covariates, we derive estimators for the extreme value index and extreme quantiles of the target variable in this setting and establish their asymptotic behavior. Our estimators substantially improve the univariate estimators, based on only the n target variable data, in terms of asymptotic variances whereas the asymptotic biases remain unchanged. A simulation study confirms the substantially improved behavior of both estimators. Finally the estimation method is applied to rainfall data in France. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

1 Introduction

The semi-supervised model, initially introduced in machine learning, deals with unbalanced datasets, when the labeled data are harder (more expensive or more time consuming) to obtain than the unlabeled data. Consider a dataset with one variable of interest, sometimes referred to as the target variable or outcome variable, and one or more covariates. The difficulty for collecting labeled data stems from collecting the target variable, whereas unlabeled data containing only the covariates, that is with the target variable missing, can be easily collected. Semi-supervised learning focuses on uncovering the (nonlinear) relation between the target variable and the covariates. Estimations and predictions based on such relations and using the additional unlabeled data often show substantially improved performance. For example, for classification analysis see Vapnik (Citation2013) and Zhu and Goldberg (Citation2009); for regression analysis see Wasserman and Lafferty (Citation2008), Azriel et al. (Citation2022), and Chakrabortty and Cai (Citation2018).

Semi-supervised inference aims at estimating parameters or quantities regarding the target variable in the semi-supervised model. Zhang, Brown, and Cai (Citation2019) investigates the general semi-supervised framework and shows how to use the unlabeled data to improve the estimation of the mean of the target variable; for inference on heavy tailed distributions in this framework, see Ahmed and Einmahl (Citation2019).

Extreme value statistics deals with estimation of parameters or quantities related to the tail of a distribution, only making semi-parametric assumptions on this tail. Consequently, most of extreme value methods start with a relatively large number of observations n, but select only kn extreme observations from the full sample for statistical inference. Two techniques are often used in selecting the extreme observations: the peaks-over-threshold (POT) approach which selects the highest k observations, and the block maxima (BM) approach which splits the full sample into k blocks and selects the maxima of each block. Since only k observations are used in estimation, typically consistent estimators have a speed of convergence of 1/k. In practice, to obtain accurate estimators for tail parameters/quantities, one needs a sample with a relatively large sample size n to guarantee a sufficient number of extreme observations. In contrast, the semi-supervised model is greatly suitable for statistics of extremes in case data on the target variable are hard to obtain.

The main goal of this article is 2-fold. First, we derive in this semi-supervised setting a new, improved pseudo-maximum likelihood estimator (MLE) for a general extreme value index γ and establish its asymptotic behavior. This extreme value index describes the tail heaviness of a probability distribution. If γ>0 the distribution is heavy tailed and has an infinite right endpoint, if γ = 0 the distribution is light tailed and may have an infinite or finite endpoint, and if γ<0 the endpoint is finite, see, for example, Beirlant et al. (Citation2004) or de Haan and Ferreira (Citation2006) for a thorough treatment of extreme value theory and the corresponding statistical inference. Second, we use the adapted estimator of γ, together with an adapted estimator of the so-called scale, to obtain an improved estimator of an extreme quantile in the semi-supervised model. We establish the asymptotic behavior of this adapted extreme quantile estimator. Simulations show that the improvement for the new extreme quantile estimator is actually larger than that for the new estimator of γ.

For ease of explanation of our novel estimator of the extreme value index, let us assume that there is only one covariate. We estimate γ for the variable of interest initially (that is ignoring the covariate) using the pseudo-MLE γ̂, see Smith (Citation1987) and Drees, Ferreira, and de Haan (Citation2004). Then, we choose a number g and for the covariate we transform the labeled data empirically (using all the labeled and unlabeled data) such that they obtain an artificial extreme value index g. Using the transformed covariates of the labeled data, we estimate the known g by the pseudo-MLE ĝ, say, and use the difference ĝg to adapt and substantially improve the initial estimator γ̂ for the extreme value index γ of the variable of interest. For this adaptation the tail dependence between the target variable and the covariate is crucial.

The specific contributions of this article are as follows. First, we provide a general result for the relevant, nonstandard tail quantile process (see Lemma 6.5). Based on this tail quantile process result, one could improve other estimators based on the POT approach in the semi-supervised model. Also, we impose no assumptions on the tail of the covariates. When analyzing the tail of the target variable, it is crucial to assume regularity in its tail such as the max-domain of attraction condition in extreme value analysis. However, requiring such conditions for the covariates can be restrictive in applications. Finally, our main results are valid for a broad class of distributions for the target variable: we deal with a general extreme value index γ, which can be positive, zero, or negative. This is particularly important for applications where the sign of γ is not known beforehand. For example, when analyzing extreme weather, various studies find that the extreme value index is around zero for different meteorologic variables: for hourly surge level on the English east coast (Coles and Tawn Citation1991), for hourly maximum wind speed in Sheffield, UK (Coles and Walshaw Citation1994), for wave height and still water level on the Dutch coast (de Haan and de Ronde Citation1998) and for daily rainfall in North Holland, The Netherlands (Buishand, de Haan, and Zhou Citation2008).

This article is organized as follows. In Section 2, for clearness of the exposition, we first introduce our adapted estimator for the extreme value index in the semi-supervised model with one covariate and we establish its asymptotic normality. Subsequently we introduce the adapted estimator of an extreme quantile and analyze its asymptotic behavior. In Section 3 we consider the general multivariate semi-supervised setting and present and establish asymptotic normality of the adapted estimator of the extreme value index. Section 4 is devoted to a simulation study. The improved performance, in terms of variance, of the adapted estimators compared with the initial estimators is shown. An application to rainfall in France can be found in Section 5 and the detailed proofs are deferred to Section 6 and the supplementary material.

2 Main Results: One Covariate

2.1 Estimation of the Extreme Value Index

Let F be a bivariate distribution with marginals F1 and F2. We assume that F1 is in the max-domain of attraction of an extreme value distribution Gγ, where γ is the extreme value index, our parameter of interest. Let the pairs (X1,Y1),,(Xn,Yn) be a random sample from F, and let (Yn+1,,Yn+m) be a random sample from F2, independent from the n pairs. This is the semi-supervised model. Assume that the tail copula R of (X1, Y 1) exists: (1) R(x,y)=limt01tP(1F1(X1)tx,1F2(Y1)ty),(x,y)[0,]2{(,)}.(1)

Denote the order statistics of Xi,i=1,,n, with X1:nXn:n, and similarly for the Yi,i=1,,n. We estimate γ>12 with the often used pseudo-MLE γ̂ based on Xnk:n,,Xn:n, for k{1,,n1}; see sec. 3.4 in de Haan and Ferreira (Citation2006).

Define for i=1,,n, (2) Y˜i={(1(Fn+m(Yi)12(n+m)))g1g,g0,log(1(Fn+m(Yi)12(n+m))),g=0,(2) where Fn+m is the empirical distribution function based on Yl,l=1,,n+m, and g>12 is a number we may choose that mimics an extreme value index. Let the order statistics of Y˜i,i=1,,n, be denoted by Y˜1:nY˜n:n, and let ĝ be the pseudo-MLE of g based on Y˜nk:n,,Y˜n:n, using the same k as before. Of course, since we choose and hence know g, there is no direct need to estimate it. We will show below, however, that the dependence of the difference ĝg and γ̂, helps to improve the estimator of γ in the semi-supervised setting.

For the asymptotic theory, we assume that m=m(n) and (3) k,kn0,nn+mν(0,1),as n.(3)

We begin with establishing the joint asymptotic normality of γ̂ and ĝ, a crucial result for deriving and showing asymptotic normality of our semi-supervised estimator (SSE) of γ. For that purpose we need the usual second order condition on the marginal distribution F1. Let U1=F11(11/·) be the tail quantile corresponding to F1. We assume that there exist a positive scale function a, a positive or negative function A, with limtA(t)=0, and ρ0, such that for x>0, (4) limtU1(tx)U1(t)a(t)xγ1γA(t)=Ψ(x),γR,(4) where Ψ(x)={xγ+ρ1γ+ρ,ρ<0,1γxγlogx, γρ=0,12log2x,γ=ρ=0, see de Haan and Ferreira (Citation2006), p. 46.

Proposition 2.1.

Assume γ>12 and choose g>12. Assume that F2 is continuous, (1, 3), and (4) hold, and kA(nk)λR, as n, then with probability tending to 1, there exist unique maximizers of the likelihood functions based on {Xi}i=1n and {Y˜i}i=1n, denoted as (γ̂,ĝ), such that (k(γ̂γ),k(ĝg))dN([λ(1+γ)(1ρ)(1+γρ),0],Σ)where Σ=[(1+γ)2(1ν2)(1+γ)(1+g)Rg(1ν2)(1+γ)(1+g)Rg(1ν2)(1+g)2], with (5) Rg=R(1,1)+gγγ+g+1×((2γ+1)01R(s,1)s1γds(2g+1)01R(1,t)t1gdt).(5)

Based on Proposition 2.1, we derive the SSE of γ. For this derivation only, take λ=0. Then the approximate bivariate normal distribution of (γ̂,ĝg), has mean [γ,0] and estimated covariance matrix 1kΣ̂=1k[​​​(1+γ̂)2(1nn+m)(1+γ̂)(1+g)R̂g(1nn+m)(1+γ̂)(1+g)R̂g(1nn+m)(1+g)2​​​],where R̂g is the estimator of Rg, obtained by replacing γ with γ̂ and the tail copula R, at 3 places, with its natural estimator (6) R̂(x,y)=1ki=1n1[XiXn[kx]+1:n,YiYn[ky]+1:n],x,y0,(6) see, for example, Drees and Huang (Citation1998). Maximizing the thus obtained approximate likelihood function of the single “data point” (γ̂,ĝg) with respect to the unknown γ we obtain as SSE for γ: (7) γ̂g=γ̂1+γ̂1+gR̂g(ĝg).(7)

Now we present the main result of this subsection, the asymptotic normality of the SSE of γ.

Theorem 2.1.

Under the conditions of Proposition 2.1, as n, (8) k(γ̂gγ)dN(λ(1+γ)(1ρ)(1+γρ),(1+γ)2[1(1ν2)Rg2]).(8)

Remark 2.1.

Note that the asymptotic bias of the SSE γ̂g is the same as that of the pseudo-MLE γ̂ (in Proposition 2.1). Therefore, when comparing both estimators we can and will focus on the (relative) reduction of the asymptotic variance which is equal to (1ν2)Rg2. The value of the crucial Rg[1,1] depends on the known g and the unknown γ and R. Note that Rg can indeed be positive, zero, or negative and Rg can exceed R(1, 1) even when R is symmetric in its arguments. Nevertheless it is appealing to consider g=γ, reducing Rg to R(1). Since γ is unknown, this would lead to the choice g=γ̂. We will not pursue this, however, since simulation results show that this random g leads to much smaller variance reductions than when taking a deterministic g not too far away from γ. Also observe that when g is close to γ and R is symmetric, then RgR(1,1) is of order (gγ)2, that is, the variance reduction does not change much with the choice of g and is close to R(1). We will see in Section 4 through simulations that the variance reduction is substantial in “standard” semi-supervised settings.

Finally, note that in general if the function R is identical to 0 (tail independence), γ̂g has the same asymptotic behavior as γ̂. If R(x,y)>0, for some x,y>0 (tail dependence), then R(1,1)>0. In that case for g=γ we obtain Rg=R(1,1): the asymptotic variance of γ̂g is smaller than that of γ̂ and it decreases with R(1). In case of tail independence, simulations indicate that in finite-samples there can be a variance inflation up to about 10%; see Section D in the supplementary material. Therefore, it is recommended to apply the method in practice only when the data clearly exhibit tail dependence.

2.2 Estimation of an Extreme Quantile

Let p be a very small positive number. For the asymptotic theory this means that p=p(n) with limnp(n)=0. Then the extreme quantile xp is defined by xp=F11(1p)=U1(1/p). We first estimate γ and the scale a(n/k) with the pseudo-MLEs γ̂ (as above) and σ̂, based on Xnk:n,,Xn:n, see, for example, sec. 3.4 in de Haan and Ferreira (Citation2006). The natural estimator of the extreme quantile xp is then given by (see Dekkers, Einmahl, and de Haan Citation1989; de Haan and Rootzén Citation1993) (9) x̂p=Xnk:n+σ̂ (knp)γ̂1γ̂.(9)

Using (2), we see that g and (nk)g are numbers that mimic an extreme value index and a scale. They are estimated with ĝ and σ˜g using the pseudo-MLEs based on Y˜nk:n,,Y˜n:n, for the same k as above. Similar to the derivation of γ̂g, we find an adapted estimator of σ based on the joint asymptotic normality of σ̂ and σ˜g. For this, write for g0 Sg=(γ+2)(g+2)R(1,1)g(γ+1)γ01R(s,1)sdsγ(g+1)g01R(1,t)tdt+(γ+1)2(g+1)(2γ+1)γg(1g+11γ+1+(g+1)gγγ+g+1)01R(s,1)s1γds+(γ+1)(g+1)2(2g+1)γg(1γ+11g+1+(γ+1)γgγ+g+1)01R(1,t)t1gdt,S0=2(γ+2)R(1,1)+(3γ1)01R(1,t)tdt+γ01R(1,t)tlogt dt(2γ+1)201R(s,1)s1γds,and estimate it with Ŝg obtained by replacing γ with γ̂ and R with R̂. Now the SSE for the scale σ is defined by σ̂g=σ̂(1Ŝg1+(1+g)2(σ˜g(nk)g1)).

The adapted version of the extreme quantile estimator in (9) is now obtained by plugging in the SSEs for the extreme value index and the scale instead of the pseudo-MLEs: (10) x̂pg=Xnk:n+σ̂g(knp)γ̂g1γ̂g.(10)

In order to establish its asymptotic normality, we first present a theorem which states the joint asymptotic normality of the three random components involved in this definition. The proofs of the results in this section are deferred to the supplementary material, Section C.

Theorem 2.2.

Assume γ>12 and choose g>12. Assume that F2 is continuous, (1, 3), and (4) hold, and kA(nk)λR, as n. Then as n, (11) k(γ̂gγ,σ̂ga(nk)1,Xnk:nU1(nk)a(nk))dN([λ(γ+1)(1ρ)(1+γρ),ρλ(1ρ)(1+γρ),0],K),(11) K=[(1+γ)2[1(1ν2)Rg2](1+γ)[1+(1ν2)Q](1ν2)M1(1+γ)[1+(1ν2)Q]1+(1+γ)2(1ν2)Sg21+(1+g)2γ+(1ν2)M2(1ν2)M1γ+(1ν2)M21],where Q=RgSg1+(1+g)2+1+γ1+gRgQ1+Sg1+(1+g)2Q2, Rg is as in (5), Q1:=(1+g)2[γ2(g+1)(γ+1)R(1,1)+γγ+101R(1,t)t1tggdt+(2(gγ)((γ+1)(g+1)+g)(γ+1)(g+1)(γ+g+1)+2g+1(g+1)(γ+g+1))×01R(1,t)t1gdt+(2γ+1)(γg)(γ+g+1)(g+1)01R(s,1)s1γds],Q2:=(1+γ)(1+g)[g2(g+1)(γ+1)R(1,1)+gg+101R(s,1)s1sγγds+(2(γg)((γ+1)(g+1)+γ)(γ+1)(g+1)(γ+g+1)+2γ+1(γ+1)(γ+g+1))×01R(s,1)s1γds+(2g+1)(gγ)(γ+g+1)(γ+1)01R(1,t)t1gdt],M1:=(1+γ)(1+g)Rg×[201R(1,s)s1gds01R(1,s)s1sggds1g+1R(1,1)],M2:=(1+g)1+(1+g)2Sg×[(2g+3)01R(1,s)s1gds01R(1,s)s1sggdsg+2g+1R(1,1)].

In the expression of Q1,Q2,M1, and M2, when considering g = 0 or γ = 0, the terms such as 1tgg, 1sγγ should be read as their corresponding limit.

From Theorem 2.2, we obtain the main result in this section, the asymptotic normality of the adapted extreme quantile estimator. Define qγ(t)=1tsγ1logs ds, for t > 1.

Theorem 2.3.

Assume γ>12 and choose g>12. Assume that F2 is continuous, (1, 3), and (4) hold, the second-order parameter ρ is negative, or zero with γ negative, and kA(nk)λR, as n. Further assume p=o(k/n) and log(np)=o(k), as n. Then as n, (12) kx̂pgxpa(nk)qγ(knp)dN(λ(1+γ)(1ρ)(γρ+1)b,v)(12) where b = 1 and v=(1+γ)2[1(1ν2)Rg2], for γ0>ρ, and b=ρ(1+2γ)/(γ+ρ) and v=(1+γ)2(1+2γ)(1ν2)[(1+γ)2Rg2+γ2Sg21+(1+g)22γ(1+γ)Q2γ2M1+2γ3M2], for γ<0.

Remark 2.2.

Note that the asymptotic bias is the same as that of the standard extreme quantile estimator x̂p in (9), whereas the asymptotic variance is different. In case γ0, the relative variance reduction when using the adapted estimator is (1ν2)Rg20, which is the same as the relative variance reduction of the adapted extreme value index estimator. For γ<0 it is not clear if the variance reduction is nonnegative, but in the simulation study below we show substantial variance reductions, irrespective of the sign of γ. A joint variance reduction approach, instead of improving γ̂ and σ̂ separately, would lead to a nonnegative variance reduction for all γ>12, but this is beyond the scope of this article.

3 Main Results: Multiple Covariates

In this section we consider the more general situation with d–1 covariates where d > 2. Consider a d-variate distribution F, with marginals F1,,Fd. We assume again that (only) F1 is in the max-domain of attraction of an extreme value distribution Gγ. Let F be the distribution function of the last d–1 components of a random vector with distribution function F. Let (X1,Y1,2,,Y1,d),,(Xn,Yn,2,,Yn,d) be a random sample of size n from F and let (Yn+1,2,,Yn+1,d),,(Yn+m,2,,Yn+m,d) be a random sample of size m from F, independent of the d-variate random sample of size n. This is the multivariate semi-supervised setting.

Then, for fixed j=2,,d, we use all data for the covariates {Yi,j}i=1n+m to obtain {Y˜i,j}i=1n as in (2), where we may choose a number g>12, that mimics an extreme value index, as before. For k{1,,n1}, let, similarly as in the previous section, γ̂ and ĝj, j=2,,d, be the pseudo-MLEs of γ and (d–1 times) of g, respectively. Assume the existence of the tail copula Rij of the ith and the jth component, (13) Rij(x,y)=limt01tP(1Fi(Y1,i)tx,1Fj(Y1,j)ty),(13)

where (x,y)[0,]2{(,)},1i,jd. Here Y1,1 is understood as X1. Again, we first consider the joint asymptotic normality of γ̂, and ĝj,j=2,,d.

Proposition 3.1.

Assume γ>12 and choose g>12. Assume that Fj,j=2,,d, is continuous, (3, 4), and (13) hold, and as n, kA(nk)λR, then with probability tending to 1, there exist unique maximizers of the likelihood functions based on {Xi}i=1n, {Y˜i,2}i=1n, , {Y˜i,d}i=1n, denoted as (γ̂,ĝ2,,ĝd), such that (k(γ̂γ),k(ĝ2g),,k(ĝdg))dN([λ(γ+1)(1ρ)(1+γρ),0,,0],Σd),with Σd=ΓΓT°H (“°” is the Hadamard or entrywise product), where Γ=[1+γ1+g..1+g],H=[1h12..h1dh121ν2..h2d........h1dh2d..1ν2],

h1i=(1ν2)[R1i(1,1)+gγγ+g+1[(2γ+1)01R1i(s,1)s1γds (2g+1)01R1i(1,t)t1gdt]], and hij=(1ν2)Rij(1,1),i=2,,d,j=i+1,,d.

Very similar to the bivariate case, let λ = 0 and derive the SSE of γ by using the approximate multivariate normal distribution of (γ̂,ĝ2g,,ĝdg), with mean [γ,0,,0], and variance 1kΣ̂d=1kΓ̂Γ̂T°Ĥ, where for the estimation of Rij, R̂ij is defined like in (6). By maximizing the approximate likelihood function of (γ̂,(ĝ2g),,(ĝdg)) with respect to γ, we obtain the SSE in this multivariate setting: (14) γ̂g=γ̂+1+γ̂1+gj=2dĤ1j1Ĥ111(ĝjg),(14) where Ĥij1 is the entry in the ith row and jth column of the inverse of the matrix Ĥ. The following theorem shows the asymptotic behavior of the improved estimator γ̂g.

Theorem 3.1.

Assume that H is invertible. Then under the conditions of Proposition 3.1, as n, (15) k(γ̂gγ)dN(λ(1+γ)(1ρ)(1+γρ),σ2),(15) where σ2=(1+γ)2(1+1(H111)2[2i=1dj=i+1dH1i1H1j1hij      +(1ν2)j=2d(H1j1)2]).

Remark 3.1.

It can be shown that increasing the number of covariates leads to a lower (or the same) asymptotic variance. More specifically, the quality of the estimation improves if we replace the “one-covariate” estimator γ̂g in (7) by the present one. In particular, if we consider d = 3 and for simplicity g=γ and R12(1,1)=R13(1,1), the asymptotic relative variance reduction in this case is equal to 2(1ν2)R122(1,1)/(1+R23(1,1)), see the formula for the asymptotic relative variance reduction in Ahmed and Einmahl (Citation2019), which applies in this case. Consequently, the maximal variance reduction is achieved when R23(1,1) takes its minimal attainable value. Decreasing the tail dependence of the covariates will make them, as a pair, more relevant and hence will make the variance reduction larger. See also the simulations in the trivariate case at the end of the next section.

4 Simulation Study

Here we perform mainly for the one-covariate setting a detailed simulation study, but at the end of this section we briefly consider two covariates.

First we investigate the finite sample performance of our novel SSE of γ and then we compare in detail the variances of the SSE with those of the pseudo-MLE based on Xnk:n,,Xn:n only. In addition, we compare the variances of the adapted extreme quantile estimator with those of the standard estimator in (9).

We begin with simulating data from the bivariate Cauchy distribution restricted to the first quadrant. This Cauchy density is proportional to (1+xS1xT)3/2 where S is a 2 × 2 scale matrix with 1 on the diagonal and s off-diagonal. For s we take alternatingly two values: 0 and 0.8. For the value 0.8 we have a rather strong tail dependence and for s = 0 the tail dependence is somewhat weaker. More precisely, the value 0 corresponds to an R(1)-value of 0.59, whereas for s = 0.8 this value is 0.76. We expect that stronger tail dependence leads to a larger variance reduction. These data are denoted by (Xiˇ,Yi). To obtain our data (Xi, Y i), where the Xi have extreme value index γ, we transform the Xˇi, as follows (16) Xi={(1Fs(Xˇi))γ1γ, γ0,log(1Fs(Xˇi)), γ=0,(16) where Fs is the distribution function of Xiˇ. Simulations are performed for values of γ that are negative, positive or 0.

First, we generate 500 samples of sizes n=500,m=1000, for s = 0.8, and estimate γ using the SSE and the pseudo-MLE for n=500,m=1000. We depict the root mean squared error (RMSE) based on these 500 samples as a function of k. We consider γ=0.25,0, and, 0.25, and take g = 0. The RMSE of the SSE (indicated by AMLE in ) is indeed substantially lower than that of the pseudo-MLE for the different values of γ.

Fig. 1 RMSE using the pseudo-MLE and the SSE-MLE. From left to right: γ=0.25,0,0.25.

Fig. 1 RMSE using the pseudo-MLE and the SSE-MLE. From left to right: γ=−0.25,0,0.25.

Next, we focus on the (relative) variance reduction of the SSE in comparison to the pseudo-MLE. We use the following values of n and m (and k):

  • n=1000, m = 500 (less unlabeled than labeled data) and k = 250,

  • n=1000, m = 1000 (equal number of unlabeled and labeled data) and k = 250,

  • n=500, m = 1000 (more unlabeled than labeled data) and k = 125.

shows the empirical percentages of variance reduction for different values of γ and g. The results are based on 10,000 replications. We observe that the variance reduction ranges from 10% to more than 30%, hence, indeed the SSE has a substantially smaller variance than the pseudo-MLE. By comparing the three panels, we observe that the variance reduction increases substantially with the ratio of the number of unlabeled data m and the number of labeled data n, which is in line with the asymptotic theory. Observe that the actual choice of g does not have a large influence as long as it is somewhat close to γ, a choice that is in practice often feasible.

Table 1 Variance reduction for different extreme value indices.

Next we investigate in more detail the sensitivity of the variance reduction to the choice of g using a wider range of values of g, including cases where γ and g have opposite sign. In these simulations, we take s = 0 for the bivariate Cauchy distribution. The results, based on 10,000 replications, for the aforementioned values of n, m and k are presented in . Generally, there is always variance reduction, but for |gγ| relatively large the reduction is lower than when g is closer to γ. Observe that for the present range of γ (–0.3 to 0.3) the choice g = 0 yields an almost maximal variance reduction.

Fig. 2 Variance reduction for various combinations of γ and g.

Fig. 2 Variance reduction for various combinations of γ and g.

Finally we study in more detail the effect of the size of m, the number of unlabeled data, on the variance reduction; again we take s = 0. We consider the case where n = 500 and let m vary; we choose g = 0. The results are based on 500 replications. shows that the variance reduction approximately doubles when m ranges from 500 to 10,000.

Table 2 Variance reduction for different numbers of unlabeled data m.

The last part of the one-covariate simulation study is devoted to extreme quantile estimation. We take again s = 0 for the Cauchy distribution and use 10,000 replications. For n=500,m=1000, and k=125, we estimate the extreme quantile xp for p=1/n=0.002. shows the variance reductions when using the adapted extreme quantile estimator in (10) relative to the standard estimator in (9), for various values of γ and g. The variance reductions are substantial, also for γ<0, and range from 17% to 33%.

Table 3 Variance reduction for the extreme quantile estimator.

We also study the effect of using the SSEs for both the extreme value index and the scale leading to the extreme quantile estimator in (10), compared to using a hybrid approach that only uses the SSE γ̂g for the extreme value index but keeps the classical σ̂ when estimating the scale.

We take g = 0 and estimate the extreme quantile xp for p=1/n. shows the remarkable effect on the variance reduction when using the adapted estimator in (10) instead of the hybrid approach. The variance reduction is doubled in most of the cases, meaning that replacing σ̂ by the SSE σ̂g leads to a considerably improved performance.

Fig. 3 Relative variance reduction for the extreme quantile estimators.

Fig. 3 Relative variance reduction for the extreme quantile estimators.

To conclude the bivariate setting we briefly compare the here obtained sizable variance reductions with the reductions obtained from the asymptotic theory. It turns out that for the present sample sizes, when estimating γ, the asymptotic theory yields even higher reductions than the ones here obtained, partly due to the variability in the necessary estimation of the tail copulas, which is not reflected in the asymptotic variance. Based on limited comparisons it seems that for positive γ, asymptotic theory and simulations match somewhat better than in case γ is negative. Theoretically, for γ0 the asymptotic variance reduction of the extreme quantile estimator x̂pg is the same as that of γ̂g. In contrast, , in conjunction with , indicates that for the practically relevant extreme quantile estimation, the variance reductions obtained in simulations can be even higher than those inferred from the asymptotic theory.

At the end of this simulation section we now briefly consider the two-covariates setting. We begin with simulating data from the trivariate Cauchy distribution restricted to the first octant. The density is proportional to (1+xS1xT)2 where S is now a 3 × 3 scale matrix with 1 on the diagonal and s off-diagonal, but S23=S32=r. We take respectively the values s=r=0,s=r=0.8 and s=0.8,r=0.3. When s=r=0.8, we have rather strong tail dependence, with R(1)-values of 0.77, whereas for s=r=0 the R(1)-values are 0.59. The case s=0.8,r=0.3 yields R(1)-values of 0.81, except the one between the covariates, which is 0.63. Then similarly as in (16) in the bivariate case, we transform the first coordinate to obtain our final data (Xi,Yi,2,Yi,3); we will use γ=0.2,0 and 0.2. We take n=500,m=1000,k=125 and 10,000 replications. shows that the variance reduction can be as high as 50%. We observe that the case s=r=0.8 indeed yields larger improvements than the case s=r=0 and the one-covariate setting in . Interestingly the case s=0.8,r=0.3 yields even larger variance reductions than the case s=r=0.8. The smaller tail dependence between the covariates, makes them as a pair more informative and leads to the much improved performance of γ̂g.

Table 4 Variance reduction with two covariates.

5 Application

In this section, we demonstrate an application using the SSE for analyzing forecasted precipitation data.

The national French weather service, Météo France, produces daily forecasted precipitation (in mm) at very high resolution (0.1°×0.1°) covering the mainland of France, between 2012 and 2017. Although usually an ensemble is forecasted, we use one single forecast (an ensemble member, not a forecasted mean or median) at every grid point and every day. To improve the forecasting model, meteorologists want to check if the forecasted precipitation shares the same distribution as the observed precipitation at the same location, particularly in the right tail. Consequently, the goal of this study is to estimate quantities such as the extreme value index and extreme quantiles of the forecasted precipitation distribution. We focus on forecasting at grid points that are close to an actual weather station.

Besides the forecasted precipitation, Météo France records at 123 weather stations the actual daily precipitation, between 1980 and 2017. We pair each weather station with a forecasting grid point that is closest to the station, and regard the two as the same location. When focusing on the fall seasons (91 days per year), at the 123 locations, we have 38 years actual precipitation data (3458 observations) with the last 6 years paired with forecasting data (546 observations). For the last 6 years, the paired data are dependent since the forecasting data are made to forecast the precipitation at the same location on the same day. Part of this dataset has been employed in a study comparing the spatial dependence structure of extreme forecasted precipitation and extreme observed precipitation in southern France; see Oesting and Naveau (Citation2020).Footnote1

Since it is challenging to conduct extreme value analysis, such as extreme quantile estimation, for the forecasted precipitation based on only 546 observations, we make use of the available information in the actual precipitation to improve the estimation accuracy, exploiting the semi-supervised setting. Observe that the SSE uses the tail dependence between the two datasets, but not the (marginal) tail heaviness of the observed precipitation. Therefore, the SSE of the extreme value index of the forecasted precipitation does not incorporate properties of the distribution of the observed precipitation. This justifies comparing this SSE with the MLE of the extreme value index of the observed precipitation.

To validate that our proposed methodology can be applied to the dataset, we perform two pre-tests on the actual precipitation data (3458 observations) at each station. First, we test whether the actual precipitation at each station possesses the same distribution across time, using the test statistic T2 proposed in Einmahl, de Haan, and Zhou (Citation2016) (with k = 200). Second, we test whether the extreme precipitation at each station can be regarded as independent over time, based on testing whether the extremal index is significantly different from 1, using the sliding block estimator proposed in Berghaus and Bücher (Citation2018) (with b = 80). To achieve the 5% significance level for the joint test, we exclude all stations for which any of the two tests rejects the null at the 2.5% significance level. Eventually, for 100 stations both null hypotheses are not rejected, and we apply our proposed method to these 100 stations.

We use the SSE γ̂g with g = 0 to estimate the extreme value index, and compare it with the pseudo-MLE γ̂. For both estimators, we take k = 136. In particular, we estimate both the tail dependence coefficient R(1, 1) and the variance reduction factor (1ν2)Rg2 to demonstrate the presence of tail dependence and to evaluate the improvement when using the SSE. In addition, we estimate the “once per 10 year” extreme rainfall, that is the quantile at the probability level 11/910, by (9) and (10), to compare the impact of using the SSE on practically relevant quantities. Last but not least, we estimate the extreme value index and the same level high quantile based on the actual precipitation, using the pseudo-MLE. In the estimation, we use the top k1=600 observations, which in line with (3) corresponds to a lower fraction k1/(n+m) of POT compared to k/n used for the forecasted precipitation.

shows the results for three selected stations. We select the three stations from very distant areas: one from the south, one from the northwest, and one from the southwest. For the station from the south, Nîmes, the estimated extreme value index is positive indicating a heavy-tailed distribution. Taking ν2=n/(n+m), the reduction in variance is estimated at 16%. The difference of the two estimates of the extreme value index leads to a substantial difference in the quantile estimates: the quantile estimated using the SSE exceeds the usual quantile estimate with roughly 50%. In contrast, for the station in the northwest, Dieppe, the estimated extreme value index is around zero. The difference between the two point estimates is small, with the SSE having 17.5% variance reduction. The two quantile estimates are about the same. Finally, for the station in the southwest, Ciboure, both estimators lead to negative estimates, although not significantly different from zero. The variance reduction is at a pronounced level: 28.4%. The quantile estimate using the SSE is somewhat lower than the usual one. Observe that the quantile estimates based on the forecasting data are overall lower than those based on the observed precipitation, also when using the SSE. This suggests that the forecasting model produces forecasts with a lighter tail than that of the actual observations. We added 90% confidence intervals for the extreme value indices and extreme quantiles. Here for the SSE approach, the “common” quantities in the asymptotic variance that have to be estimated consistently (in particular  (1+γ)2 and a(n/k)qγ(k/(np))) are estimated with the minimum of the estimates obtained from the MLE and SSE procedures.

Table 5 Estimation results for three stations.

Finally we present a heatmap of the estimated variance reduction factors across all 100 locations in . The variance reduction using the SSE compared to the pseudo-MLE ranges from 9% to 34.8%, and is on average 18.5%.

Fig. 4 Heatmap of the variance reduction factor across 100 stations.

Fig. 4 Heatmap of the variance reduction factor across 100 stations.

6 Proofs

We first present proofs for the one-covariate (bivariate) case and then extend the proofs to the multivariate case. The asymptotic normality for k(γ̂γ), the first component of the pair in Proposition 2.1, is established in Drees, Ferreira, and de Haan (Citation2004), see also the supplementary material, Section A. However, we cannot directly use that proof, since we have to keep track of the joint behavior of γ̂ and ĝ. Nevertheless, we mimic that proof for both γ̂ and ĝ, with observing that ĝ is based on dependent observations. In this respect the proof has to be adapted substantially. We begin with various lemmas which are needed for the main proofs.

Let C be a copula corresponding to the distribution function of (X1,Y1). Let (V1,1,V1,2), (V2,1,V2,2),(Vn,1,Vn,2) be a random sample of size n from C and Vn+1,2,,Vn+m,2 be a random sample of size m from the uniform-(0, 1) distribution, independent of the random sample from C. Clearly all the Vi,j,i=1,,n,j=1,2, have also a uniform-(0, 1) distribution. Write Xi=F11(1Vi,1),i=1,,n, and Yl=F21(1Vl,2),l=1,,n+m. Then (Xi, Y i), i=1,,n, and Yn+1,,Yn+m have the distributions as specified in the beginning of Section 2.

Consider the following uniform empirical distribution functions: Γn,j(s)=1ni=1n1[0,s](Vi,j),0s1,j=1,2,Γn+m(t)=1n+ml=1n+m1[0,t](Vl,2),0t1.

The corresponding uniform tail empirical processes are wn,j(s)=k[nkΓn,j(kns)s],0s1,j=1,2,wn+m(t)=(n+m)kn[nkΓn+m(knt)t],0t1.

Define the Gaussian vector of processes (W1,W2,W3), where Wj,j=1,2,3, is a standard Wiener process on [0,T],T>0, with covariances: cov(W1(s),W2(t))=R(s,t),cov(W1(s),W3(t))=νR(s,t),cov(W2(s),W3(t))=ν(st),0s,tT.

Let I denote the identity function. Then we have on (D[0,T])3, for all 0δ<12 (17) (wn,1Iδ,wn,2Iδ,wn+mIδ)d(W1Iδ,W2Iδ,W3Iδ),  as  n.(17)

The proof of (17) is given in Ahmed and Einmahl (Citation2019); note that in there T = 1, but the proof for arbitrary T > 0 follows similarly. Now using a Skorohod construction we obtain from (17) that (18) sup0<sT|wn,j(s)Wj(s)|sδa.s.0,j=1,2,and sup0<sT|wn+m(s)W3(s)|sδa.s.0.(18)

The processes in (18) are different from those in (17) but we keep the same notation, since the new vector (wn,1,wn,2,wn+m) has the same distribution as the old vector and also the new vector (W1,W2,W3) has the same distribution as the old vector. In the sequel the Xi and Yi are transformations as above of the uniform-(0,1) random variables on which the wn,j are based. We continue with the processes satisfying (18).

For convenience we introduce the following notation. Let fn, hn be positive functions on [ln,un]. Then we write, as n, fnPhn|lnun,if both fn(s)=OP(hn(s)) and hn(s)=OP(fn(s)) hold uniformly for s[ln,un]. This notation is useful for the following lemma, see Shorack and Wellner (Citation2009, p. 419).

Lemma 6.1.

Let Γn,j1,j=1,2, be the empirical quantile functions corresponding to Γn,j,j=1,2, respectively. Then, as n, Γn,jPI|Γn,j1(1/(2n))1andΓn,j1PI|1/(2n)1,j=1,2.

The following lemma states the weighted convergence of the tail quantile processes corresponding to Γn,j1,j=1,2, to the processes W1 and W2 in (18).

Lemma 6.2.

Let Γn,j1, be the empirical quantile functions corresponding to wn,j, j = 1, 2, in (18) and let Wj be as in (18), j = 1, 2. Then for any δ<12, as n, (19) sup12ks1|k(nkΓn,j1(kns)s)+Wj(s)|sδP0.(19)

Proof.

Write vn,j(s)=k(nkΓn,j1(kns)s),j=1,2. Theorem 2.3 in Einmahl (Citation1992) yields, as n, (20) sup12ks1|vn,j(s)Wn,j(s)|sδP0,(20) where Wn,j is an appropriate sequence of standard Wiener processes. Let W be a standard Wiener process and let ε>0. It is well-known that there exist an η>0, such that (21) P(sup0<sη|W(s)|sδε2)ε2.(21)

Combining (20) and (21) yields, for large n, (22) P(sup12ksη|vn,j(s)|sδε)ε.(22)

Combining (18, 21, 22), and Lemma 1 in Vervaat (Citation1972), yields (19). □

The next lemma is very similar to Lemma 3.1 in Drees, Ferreira, and de Haan (Citation2004), but the lemma therein cannot be used here because we need specifically the approximation with the present W1 in order to obtain the joint behavior of γ̂ and ĝ.

Lemma 6.3.

Let ε>0. Assume that (3) and (4) hold and kA(nk)=O(1), as n. Then for suitably chosen functions a and A in (4), as n, sup12ks1sγ+1/2+ε|k(Xn[ks]:nU1(nk)a(nk)sγ1γ)sγ1W1(s)kA(nk)Ψ(s1)|P0.

Proof.

From (4) we obtain inequality (2.3.17) in de Haan and Ferreira (Citation2006): for any θ,δ>0 to be specified later, there exists t0=t0(θ,δ) such that for all t,txt0, |U1(tx)U1(t)a(t)xγ1γA(t)Ψ(x)|θxγ+ρmax(xδ,xδ).

We replace tx by 1/Γn,11(kns) and t by nk. Then we have, writing sˇ=nkΓn,11(kns), with probability tending to 1, as n, (23) |Xn[ks]:nU1(nk)a(nk)sˇγ1γA(nk)Ψ(sˇ1)||A(nk)|θsˇγρ·max(sˇδ,sˇδ).(23)

Define f(s)=sγ1γ. Then by a Taylor expansion for some Θˇn(s) between sˇ and s we have f(sˇ)f(s)=f(s)(sˇs)+12f(Θˇn(s))(sˇs)2. Lemma 6.1 implies ΘˇnPI|12k1 and thus f(Θˇn)PIγ2|12k1. Next, by Lemma 6.2 and the fact that for all δ1<12, sup0s1|W1(s)|/sδ1=OP(1), we have that sup12ks1(sˇs)2/s2δ1=OP(1/k), as n. This and again Lemma 6.2 with δ=δ1 yield, as n, uniformly for all 12ks1, f(sˇ)f(s)=sγ11k(W1(s)+sδ1oP(1))+sγ2+2δ1OP(1k).

Choose δ1 such that 1ε2<δ1<12. Then δ1>12ε and 2δ1+ε>1. Hence, as n, sup12ks1s32+ε+2δ1max(1,(2k)32ε2δ1)=o(k).

Therefore, as n, uniformly for all 12ks1, (24) f(sˇ)=f(s)+1ksγ1(W1(s)+sδ1oP(1)+s1+2δ1OP(1k))=f(s)+1ksγ1(W1(s)+s1/2ε(sδ11/2+εoP(1)+s3/2+ε+2δ1OP(1k)))=sγ1γ+1ksγ1(W1(s)+s1/2εoP(1)).(24)

From the mean value theorem, for some Θn(s) between sˇ and s Ψ(sˇ1)=Ψ(s1)Ψ(1/Θn(s))(Θn(s))2(sˇs).

As above, ΘnPI|12k1, which implies that as n, uniformly for 12ks1, |Ψ(1/Θn(s))(Θn(s))2|=sγρ1(1+|logs|)OP(1).

Hence, using Lemma 6.2 with δ=δ1 (as above), we have uniformly for 12ks1, A(nk)(Ψ(sˇ1)Ψ(s1))=1kA(nk)sγρ1+δ1(1+|logs|)OP(1).

With δ1 chosen as above, we have that as n, uniformly for 12ks1, (25) A(nk)Ψ(sˇ1)=A(nk)Ψ(s1)+1ksγε12oP(1).(25)

Next consider the right-hand side of (23), where we take δ<1/2. Using Lemma 6.1, it can be bounded, uniformly for 12ks1, by (26) θ|A(nk)|sγρδOP(1)=θk|A(nk)|1ksγδOP(1)=θ1ksγε1/2sε+1/2δOP(1)=θ1ksγε1/2OP(1).(26)

Now, plugging (24)–(26) into inequality (23) and noting that θ>0 can be chosen arbitrarily small, we obtain the statement in the lemma. □

Define Zn(s)=k(Xn[ks]:nXnk:na(nk)sγ1γ).

Then for functions a and A as in Lemma 6.3, for any ε>0, uniformly for 12ks1, Zn(s)=sγ1W1(s)W1(1)+kA(nk)Ψ(s1) +oP(1)sγ1/2ε.

Hence, for γ>12, (27) sup12ks1sγ+1/2+ε|Zn(s)|=OP(1).(27)

Proposition 6.1.

Under the conditions of Lemma 6.3, for γ>12 and γ0, with probability tending to 1, there exists a unique maximizer of the likelihood function based on {Xi}i=1n denoted as γ̂, such that as n, k(γ̂γ)(γ+1)2γ01(sγ(2γ+1)s2γ)Zn(s)ds=oP(1),and, for γ = 0, kγ̂+01(2+logs)Zn(s)ds=oP(1).

Proof.

The existence of γ̂ follows from Theorem 4.1 in Zhou (Citation2009). Then, using Lemma 6.3 and (27) above in conjunction with Lemma 3.2 in Drees, Ferreira, and de Haan (Citation2004), the result is obtained following the same steps as in the proof of Proposition 3.1 in Drees, Ferreira, and de Haan (Citation2004). A sketch of the main steps of the proof can be found in the supplementary material, Section A. □

To study the asymptotic behavior of ĝ we need the following result. Define w˜n(s)=nk(Γn+m(Γn,21(kns))kns) and W˜(s)=νW3(s)W2(s).

Lemma 6.4.

Assume that F2 is continuous and k satisfies (3), then for any 0δ<12, as n, sup12ks1|w˜n(s)W˜(s)|sδP0.

Proof.

We have w˜n(s)=nn+mwn+m(nkΓn,21(kns))+nk(Γn,21(kns)kns).

Define ŝ=nkΓn,21(kns). From Lemma 6.2 with j = 2, (3) and (21), we see that it suffices to show that, as n, (28) sup12ks1|wn+m(ŝ)W3(s)|sδP0.(28)

Let s0(0,1). We first handle the region ss0. Obviously we have 1/sδ1/s0δ. By Lemma 6.2, as n, (29) sup12ks1|ŝs|P0.(29)

Using this, (18), and the uniform continuity of W3 we obtain, as n, sups0s1|wn+m(ŝ)W3(s)|sδP0.

It remains to show that for ε>0 there exists s0(0,1) such that for large n P(sup12kss0|wn+m(ŝ)W3(s)|sδ3ε)3ε.

Using again (21), for this it suffices to show that P(sup12kss0|wn+m(ŝ)|/sδ2ε)2ε. Using Lemma 6.1, the proof is complete if we show that for all ε>0,κ>0 there exists s0(0,1) such that for large n P(sup12kss0|wn+m(ŝ)|/ŝδ2κ)ε. We have P(sup12kss0|wn+m(ŝ)|ŝδ2κ)P(sup0<t2s0|wn+m(t)tδ|κ)+P(ŝ>2s0).

From (18) and (21), we have that for small enough s0(0,1) the first term on the right is bounded by ε/2 for large n, and using (29) we obtain that the second term on the right also does not exceed ε/2 for large n. □

In the following we prove a result for the tail quantile process based on {Y˜i}i=1n instead of {Xi}i=1n. The proof of the next lemma uses Lemma 6.4 and is very similar to but easier than that of Lemma 6.3, and hence will be omitted.

Lemma 6.5.

Let ε>0. Assume that F2 is continuous and that (3) holds, then, as n, sup12ks1sg+1/2+ε|k((Y˜n[ks]:n(nk)g1g)(nk)gsg1g)        +tsg1W˜(s)|P0.

Define Hn(s):=k(Y˜n[ks]:nY˜nk:n(nk)gsg1g).

Then for any ε>0, uniformly for s[12k,1], Hn(s)=W˜(1)sg1W˜(s)+oP(1)sg1/2ε.

Hence, for g>12, (30) sup12ks1sg+1/2+ε|Hn(s)|=OP(1).(30)

Next we show a version of Lemma 3.2 in Drees, Ferreira, and de Haan (Citation2004) based on {Y˜i}i=1n.

Lemma 6.6.

Assume that F2 is continuous and k satisfies (3). Let gn be a sequence of random variables such that gn=g+OP(k1/2). Then, if 1/2<g<0 or g > 0, as n, (31) P(1+gnY˜n[ks]:nY˜nk:n(nk)gCnsg, forall s[12k,1])1,(31) for some random variables Cn>0 such that 1/Cn=OP(1).

If g = 0, as n, (32) P(1+gn(Y˜n[ks]:nY˜nk:n)12, forall s[12k,1])1,(32) and (33) sups[0,1]Y˜n[ks]:nY˜nk:n=OP(logk).(33)

Proof.

Consider first 1/2<g<0 or g > 0. Applying Lemma 6.1 to Γn+m and Γn,21 yields, as n, Γn+m(Γn,21(knI))PknI|12k1.

Define Gn(s)=Γn+m(Γn,21(kns))+12(n+m), s(0,1]. Hence, as n, (34) GnPknI|12k1.(34)

Observe that for g0 sg(1+gnY˜n[ks]:nY˜nk:n(nk)g)=sg(1+gng[(Gn(s))g(Gn(1))g](nk)g)=gng(Gn(s)(ks)/n)g+sg[(1(Gn(1)k/n)g)(gng1)(Gn(1)k/n)g]=:T1(s)+sg[T2T3].

From (34) and gn/gP1, we have that 1/infs[1/(2k),1]T1(s)=OP(1), as n. Lemma 6.4 for s = 1 yields that T2=OP(1/k) and hence, since g>1/2, sups[1/(2k),1]sg·T2P0. By the assumption on gn and again (34) we obtain similarly sups[1/(2k),1]sg·T3P0. This yields (31).

In case g = 0, for 1/(2k)s1, (35) Y˜n[ks]:nY˜nk:n=logGn(s)+logGn(1)2logAnlogs,(35) with An=max(sups[12k,1]Gn(s)kns,sups[12k,1]knsGn(s)).

If gn0, then 1+gn(Y˜n[ks]:nY˜nk:n)1. If gn<0, then for 1/(2k)s1, 1+gn(Y˜n[ks]:nY˜nk:n)1+gn(2logAn+log2+logk).

Since, as n, An=OP(1) and gn=OP(k1/2), we obtain (32). Finally, the sup in (33) is attained at s=1/(2k). Hence, (35) yields (33). □

Finally, the following proposition provides the asymptotic behavior of the pseudo-MLE based on {Y˜i}i=1n.

Proposition 6.2.

Assume that F2 is continuous and k satisfies (3). For g>12 and g0, with probability tending to 1, there exists a unique maximizer of the likelihood function based on {Y˜i}i=1n, denoted as ĝ, such that, as n, k(ĝg)(g+1)2g01(sg(2g+1)s2g)Hn(s)ds=oP(1),and, for g = 0, kĝ+01(2+logs)Hn(s)ds=oP(1).

Proof.

The proof follows the same steps as that of Proposition 6.1; see supplementary material, Section A. The main difference is that {Y˜i}i=1n are not iid observations. Nevertheless, Lemma 6.5 guarantees that statistics based on the tail quantile process of {Y˜i}i=1n, for example the Hill estimator for g > 0, possess similar asymptotic behavior as in the iid case, with the only difference that the random limit is driven by a proper functional of W˜ instead. Such asymptotic expansions are sufficient to ensure that the proof can still be realized.

Then, following the steps explained in the proof of Proposition 6.1, by using Lemma 6.5, (30), and Lemma 6.6, we get the analogous result as in Proposition 6.1. Note that Lemma 6.6 plays an analogous role as Lemma 3.2 in Drees, Ferreira, and de Haan (Citation2004). □

Proof of Proposition 2.1.

Combining (18), Propositions 6.1 and 6.2 we obtain, as n, (k(γ̂γ),k(ĝg))d(Ω,Ω˜),where Ω=(γ+1)2γ01(sγ(2γ+1)s2γ)(sγ1W1(s)W1(1))ds+λ(γ+1)(1ρ)(1+γρ)

and Ω˜=(g+1)2g01(tg(2g+1)t2g)(W˜(1)tg1W˜(t))dt.

Since the Wiener processes involved have mean zero, we obtain immediately the mean of the limiting pair. Also the individual variances of the limiting pair follow readily, see Drees, Ferreira, and de Haan (Citation2004). The proof is completed by calculating the covariance of the two terms, which is deferred to the supplementary material, Section B. □

Proof of Theorem 2.1.

From the uniform consistency of R̂ on [0,1]2, it can be shown that R̂gPRg. Using the latter convergence in combination with Proposition 2.1 we obtain that, as n, (36) k(γ̂gγ)=k(γ̂γ)1+γ1+gRgk(ĝg)+oP(1).(36)

Now Proposition 2.1 in conjunction with the continuous mapping theorem yields (8). □

The proof of Proposition 3.1 can be given along the same lines as that of Proposition 2.1 and will be omitted. Note that the lemmas and propositions needed for the proof of Proposition 2.1 are of univariate nature and that hence immediately very similar lemmas can be stated (and proved) in the more-covariates case. Once these results are given, only a straightforward covariance calculation remains; see Ahmed and Einmahl (Citation2019) for the joint weak convergence of all the tail empirical processes involved.

Proof of Theorem 3.1.

From the uniform consistency of the tail copula estimators we obtain Ĥ1j1PH1j1,j=1,,d, which in combination with Proposition 3.1 yields that, as n, k(γ̂gγ)=k(γ̂γ)+1+γ1+gj=2dH1j1H111k(ĝjg)+oP(1).

Now Proposition 3.1 and the continuous mapping theorem yield (15). □

Supplementary Materials

The supplementary material provides additional proofs needed, additional simulation results, as well as all codes used for the simulations.

Supplemental material

UASA_A_2333582_code__2_.zip

Download Zip (54.1 KB)

semisupervise_supplementary.pdf

Download PDF (264.1 KB)

acc-form-AhmedEinmahlZhou.pdf

Download PDF (77.4 KB)

Acknowledgments

We thank the Editor, Associate Editor, and three Referees for a careful reading of the article and various helpful comments that greatly improved the article. John Einmahl holds the Arie Kapteyn Chair 2019–2022 and gratefully acknowledges the corresponding research support.

Disclosure Statement

The authors report there are no competing interests to declare.

Notes

1 We thank Meteo France for providing the dataset. The data are available upon request, with the consent of Meteo France.

References

  • Ahmed, H., and Einmahl, J. H. J. (2019), “Improved Estimation of the Extreme Value Index Using Related Variables,” Extremes, 22, 553–569. DOI: 10.1007/s10687-019-00358-y.
  • Azriel, D., Brown, L. D., Sklar, M., Berk, R., Buja, A., and Zhao, L. (2022), “Semi-Supervised Linear Regression,” Journal of the American Statistical Association, 177, 2238–2251. DOI: 10.1080/01621459.2021.1915320.
  • Beirlant, J., Goegebeur, Y., Segers, J., and Teugels, J. (2004), Statistics of Extremes: Theory and Applications, Chichster: Wiley.
  • Berghaus, B., and Bücher, A. (2018), “Weak Convergence of a Pseudo Maximum Likelihood Estimator for the Extremal Index,” The Annals of Statistics, 46, 2307–2335. DOI: 10.1214/17-AOS1621.
  • Buishand, T., de Haan, L., and Zhou, C. (2008), “On Spatial Extremes: With Application to a Rainfall Problem,” The Annals of Applied Statistics, 2, 624–642. DOI: 10.1214/08-AOAS159.
  • Chakrabortty, A., and Cai, T. (2018), “Efficient and Adaptive Linear Regression in Semi-Supervised Settings,” The Annals of Statistics, 46, 1541–1572. DOI: 10.1214/17-AOS1594.
  • Coles, S. G., and Tawn, J. A. (1991), “Modelling Extreme Multivariate Events,” Journal of the Royal Statistical Society, Series B, 53, 377–392. DOI: 10.1111/j.2517-6161.1991.tb01830.x.
  • Coles, S. G., and Walshaw, D. (1994), “Directional Modelling of Extreme Wind Speeds,” Journal of the Royal Statistical Society, Series C, 43, 139–157.
  • de Haan, L., and de Ronde, J. (1998), “Sea and Wind: Multivariate Extremes at Work,” Extremes, 1, 7–45. DOI: 10.1023/A:1009909800311.
  • de Haan, L., and Ferreira, A. (2006), Extreme Value Theory: An Introduction, New York: Springer.
  • de Haan, L., and Rootzén, H. (1993), “On the Estimation of High Quantiles,” Journal of Statistical Planning and Inference, 35, 1–13. DOI: 10.1016/0378-3758(93)90063-C.
  • Dekkers, A. L., Einmahl, J. H. J., and de Haan, L. (1989), “A Moment Estimator for the Index of an Extreme-Value Distribution,” The Annals of Statistics, 17, 1833–1855. DOI: 10.1214/aos/1176347397.
  • Drees, H., Ferreira, A., and de Haan, L. (2004), “On Maximum Likelihood Estimation of the Extreme Value Index,” The Annals of Applied Probability, 14, 1179–1201. DOI: 10.1214/105051604000000279.
  • Drees, H., and Huang, X. (1998), “Best Attainable Rates of Convergence for Estimators of the Stable Tail Dependence Function,” Journal of Multivariate Analysis, 64, 25–46. DOI: 10.1006/jmva.1997.1708.
  • Einmahl, J. H. J. (1992), “Limit Theorems for Tail Processes with Application to Intermediate Quantile Estimation,” Journal of Statistical Planning and Inference, 32, 137–145. DOI: 10.1016/0378-3758(92)90156-M.
  • Einmahl, J. H. J., de Haan, L., and Zhou, C. (2016), “Statistics of Heteroscedastic Extremes,” Journal of the Royal Statistical Society, Series B, 78, 31–51. DOI: 10.1111/rssb.12099.
  • Oesting, M., and Naveau, P. (2020), “Spatial Modeling of Heavy Precipitation by Coupling Weather Station Recordings and Ensemble Forecasts with Max-Stable Processes,” arXiv preprint arXiv:2003.05854.
  • Shorack, G. R., and Wellner, J. A. (2009), Empirical Processes with Applications to Statistics, Philadelphia: SIAM.
  • Smith, R. L. (1987), “Estimating Tails of Probability Distributions,” The Annals of Statistics, 15, 1174–1207. DOI: 10.1214/aos/1176350499.
  • Vapnik, V. (2013), The Nature of Statistical Learning Theory, New York: Springer.
  • Vervaat, W. (1972), “Functional Central Limit Theorems for Processes with Positive Drift and their Inverses,” Probability Theory and Related Fields, 23, 245–253.
  • Wasserman, L., and Lafferty, J. D. (2008), “Statistical Analysis of Semi-Supervised Regression,” in Advances in Neural Information Processing Systems, pp. 801–808.
  • Zhang, A., Brown, L. D., and Cai, T. T. (2019), “Semi-Supervised Inference: General Theory and Estimation of Means,” The Annals of Statistics, 47, 2538–2566. DOI: 10.1214/18-AOS1756.
  • Zhou, C. (2009), “Existence and Consistency of the Maximum Likelihood Estimator for the Extreme Value Index,” Journal of Multivariate Analysis, 100, 794–815. DOI: 10.1016/j.jmva.2008.08.009.
  • Zhu, X., and Goldberg, A. B. (2009), “Introduction to Semi-Supervised Learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, 3, 1–130. DOI: 10.2200/S00196ED1V01Y200906AIM006.