Abstract
The Tukey (or halfspace) depth extends nonparametric methods toward multivariate data. The multivariate analogues of the quantiles are the central regions of the Tukey depth, defined as sets of points in the d-dimensional space whose Tukey depth exceeds given thresholds k. We address the problem of fast and exact computation of those central regions. First, we analyze an efficient Algorithm (A) from Liu, Mosler, and Mozharovskyi, and prove that it yields exact results in dimension d = 2, or for a low threshold k in arbitrary dimension. We provide examples where Algorithm (A) fails to recover the exact Tukey depth region for d > 2, and propose a modification that is guaranteed to be exact. We express the problem of computing the exact central region in its dual formulation, and use that viewpoint to demonstrate that further substantial improvements to our algorithm are unlikely. An efficient C++ implementation of our exact algorithm is freely available in the R package TukeyRegion.
Supplementary Materials
An updated R package TukeyRegion, version 0.1.6.3 where the novel exact Algorithm (B) is implemented.
A pdf file with an additional Algorithm (A3) motivated by an extension of Algorithm (A) with k = 2. Using the dual graph, we present a dataset where also this possible simplification of Algorithms (B) and (C) fails to recover the central region. Further, we propose to use the dual graph for heuristic assessment of the quality of approximation using non-exact algorithms like Algorithms (A), (A2) or (A3). This file also contains very detailed results of the complete simulation study.
A Mathematica notebook with functions for computing the dual graph of X, containing also interactive visualizations of all the examples provided in this article.
Complete R source codes for the simulation studies performed in Section 4 and the supplementary material.
Disclosure Statement
The authors report there are no competing interests to declare.
Notes
1 We consider only the depth for datasets, that is the sample depth. For general measures the depth is typically taken scaled into the interval , which is obtained by dividing our expression for hD by n. For our purposes, the integer-valued version of the depth is more convenient to work with, but this minor difference is without loss of generality.
2 By the barycenter we mean the expected value of the uniform distribution on this convex compact set.
3 A set of points in is in general position if no d + 1 of these points lie in a hyperplane.
4 Convex hull of A is defined as the intersection of all convex sets that contain A; its affine hull is the intersection of all translations of vector subspaces (that is, affine subspaces of ) that contain A.
5 At this step we slightly simplify Algorithm 2 from [LMM]. In the original version, only two relevant halfspaces of this type are found in Step 2(d) [LMM, p. 686]. Of course, our inclusion of (possibly) more than two relevant halfspaces in (A2) makes Algorithm (A) to search through more ridges. Thus, if Algorithm 2 from [LMM] is exact, then so must be our Algorithm (A). This difference is of no importance for our exposition, and does not alter any of our conclusions.
6 The extreme cases are not interesting, because clearly if (see e.g., Liu, Luo, and Zuo Citation2020, Theorem 1). Furthermore, for n even and , if the set is non-empty, then X is a halfspace symmetric (Zuo and Serfling Citation2000b) configuration of points. By Liu, Luo, and Zuo (Citation2020, Proposition 1) for d > 2 this is impossible. For d = 2 and the nontrivial case n > 2 this is possible only for a single point set (Zuo and Serfling Citation2000b, Theorem 3.1), a situation which is not covered by RidgeSearch. In fact, it can be shown that for X sampled from an absolutely continuous distribution in dimension d = 2, is either empty or a sample point from X, with probability one (Pokorný, Laketa, and Nagy Citation2023).
7 It must be, however, noted that the time complexity of Step (C3) might exceed that of Steps (C1) and (C2). For numerical evidence, see Tables 3 and 4 in [LMM] where computation times for Steps (C1) and (C2) are reported (times without brackets) together with times for Step (C3) (times in brackets).