![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
This article defines a new measure of central tendency (called the “iteratile”) by iterating a function which maps a triple of data into a triple of the median, mean, and trimean (sorted). An explicit formula is given with proof, along with brief discussion of potential applications.
Keywords:
Public Interest Statement
Often, small amounts of data and outliers make it difficult to do statistics. To measure the “typical” behavior, we usually use means or medians, but sometimes even those have problems. We offer an alternative way of measuring the center here.
The ideas in DiMarco and Savitz (Citation2013,Citation2012) take N values of data and study all their m-tiles, which separate the data in m sections, for (where the 2-tile is the median, the 4-tiles are quartiles, and the (
)-tile (the sample mean)). Observing that N values of data have N-many m-tiles, the author wondered if mapping all the data to their m-tiles and iterating would have a limit, and if so, what that limit would be.
In the case N = 3, we can get a formula. More precisely, given data (a, b, c), for , define
, and then call f(a, b, c) the sorted version of
, where the components are in increasing order. Denote f composed with itself n times by
.
Theorem 1
exists and is of the form (I,I,I), where
. (We call the limit the “iteratile” in what follows.)
Proof
If , we are done, so assume
.
Note first that the limit exists; f is a contraction mapping since each of , and
must be greater than a and less than c.
We find the general formula with the aid of two lemmas. The first allows us to focus only on the behavior of the middle term, and the second shows by induction how the middle term exhibits geometric series behavior.
Lemma 1
f(a, b, c) takes either the form: or
.
Proof
By cases: Case 1: .
The assumption is equivalent to , so
, and a similar argument yields
. Thus,
, so
.
Case 2: .
The assumption is equivalent to . Thus,
and
. Further,
=
which is negative by assumption. Thus,
Case 3: . Similar arguments to Case 2 yield:
Thus, it remains to study the difference from one iteration to the next in the middle term. Now, call for convenience (so
, etc.). We now give a formula for
which will quickly complete the proof.
Lemma 2
For any ,
.
Proof
By induction, we see that for :
For , first observe by Lemma 1 that
(1)
(1)
since (without loss of generality) by Lemma 1,
So,(2)
(2)
Now, suppose that for an arbitrary
. We want to show that it holds for
, so we study
.
An argument similar to (1) gives the relation:(3)
(3)
So, Lemma 1 again gives(4)
(4)
and it seems we need to study and
to continue.
Again, Lemma 1 tells us that will either be of the form
or
, and similarly for
. Luckily, we need to only consider the sum of
and
, so without loss of generality we know,
(5)
(5)
Now we have what we need to prove the claim. By (4),
and by (5) and induction hypothesis,
Using (3), we get the equation
We can write this as:
which reduces to
and this proves the claim.
The upshot of Lemma 2 is that each subsequent iteration “steals from a (and from c) and gives it to b,” which means we can set up the equation
The left-hand side is a telescoping series, and we know that . Also, the right-hand side is a simple geometric series, thus the above becomes
or
as claimed.
Can this be extended to higher n? Unfortunately, there does not exist a universally agreed-upon notion of “percentile”; different definitions will yield different iteratiles. Thus, without specifying exactly which notion is desired, one cannot expect a general result. It seems that one gets a “symmetric convex combination” in any case.
How “good” or “useful” is the iteratile? One potential use is when there are errors in the data, so the data-set changes. What measure of central tendency minimizes the difference? Results along these lines are discussed in DiMarco, Hollingsworth, and Savitz (Citation2015) in the context of z-scores and normal distributions.
For example, if the data-set (0,0,24) were changed to (0,3,16), then the difference in the iteratiles would be zero. But of course, one is not allowed to see the data change before selecting the appropriate measure. How does one pick a central tendency measure a priori to minimize the difference in measure for an arbitrary data change? There is a general trend: when data in the middle is (expected to be) altered, use the mean, and when data at the extremes are altered, use the median. When there is an “inbetween” for what to use? At least, relative to the trimean, the iteratile should be used if values toward the extreme are altered more, and the trimean when values toward the middle are altered more. Also, it seems that skewed alterations suggest use of iteratile.
In general, it is quite crude to have to resort to the median, as so much information is unaccounted for. The iteratile gives a bit more of that information and reduces the effects of outliers. As one piece of data gets arbitrarily large, the non-median measures will go to infinity, but at least, with different speeds (mean: 1/3, trimean: 1/4, iteratile: 3/14). As a result, perhaps using the iteratile is a decent alternative to the median. But still, is there anything special about the iteratile? One could, for example, merely make a measure of central tendency like , which would do even better ... but would it be optimal? Is there really something coming from the limit? These are complicated, longstanding, and potentially unanswerable questions, and even if the iteratile itself turns out to be inferior, perhaps the idea of iteratile may lead to other more useful alternatives and generalizations. At the very least, those who appreciate calculus may simply enjoy the result at a purely esthetic level.
Additional information
Funding
Notes on contributors
Blane Hollingsworth
Blane Hollingsworth received his PhD from Auburn University in stochastic differential equations, and is currently a visiting scholar at Indiana University. Currently, he is studying alternatives to means and medians as measures of central tendency, especially for small sets of data with outliers/errors.
References
- DiMarco, D., & Savitz, R. (2012). The M-tile means, A new class of measures of central tendency: Theory and applications. JIMS, 12, 48–56.
- DiMarco, D., & Savitz, R. (2013). The M-tile deviation: A new class of measures of dispersion. International Journal of Business Research, 13, 117–124.
- DiMarco, D., Hollingsworth, B., & Savitz, R. (2015). On resistant versions of the standard score. European Journal of Marketing, 15, 7–16.