394
Views
3
CrossRef citations to date
0
Altmetric
Theory and Methods

On Asymptotic Distributions and Confidence Intervals for LIFT Measures in Data Mining

Pages 1717-1725 | Received 01 Jul 2013, Published online: 15 Jan 2016
 

Abstract

A LIFT measure, such as the response rate, lift, or the percentage of captured response, is a fundamental measure of effectiveness for a scoring rule obtained from data mining, which is estimated from a set of validation data. In this article, we study how to construct confidence intervals of the LIFT measures. We point out the subtlety of this task and explain how simple binomial confidence intervals can have incorrect coverage probabilities, due to omitting variation from the sample percentile of the scoring rule. We derive the asymptotic distribution using some advanced empirical process theory and the functional delta method in the Appendix. The additional variation is shown to be related to a conditional mean response, which can be estimated by a local averaging of the responses over the scores from the validation data. Alternatively, a subsampling method is shown to provide a valid confidence interval, without needing to estimate the conditional mean response. Numerical experiments are conducted to compare these different methods regarding the coverage probabilities and the lengths of the resulting confidence intervals.

Notes

The two sides of this equation can actually be slightly different if r is not divisible by m. However, we will ignore this difference in the notation for simplicity, since the size of the difference is at most 1/m and does not change asymptotics in the leading order Op(1/m).

This is established by expanding the ratio for small r and noticing that for smooth Λ(r), we have Λ − π = O(r) for small r, and that κ = rπ/π0.

This suggests that we can construct a 100(1 − α)} asymptotic confidence interval for θ by θ±tq-1,αs/q. Alternatively, one can center the interval at the original whole sample estimater θ˜ to obtain θ˜±tq-1,αs/q, as described by Jiang and Zhao (Citation2014, sec. 8).

The hardware information of the computer used to derive the results is as follows for reference of timing. CPU: Intel® Core i5-3210M CPU 2.50GHz; RAM: 8.00GB; OS: Windows® 7 64 bits; Pseudo-random number generator: R®; programming language and major software component: R®.

The bootstrap method may still be very useful for analyzing real data, since it is easy to program, and being relatively slow is not a big problem when one is not repeatedly analyzing many data sets as in a simulation study.

Additional information

Notes on contributors

Wenxin Jiang

Wenxin Jiang is Taishan Scholar Overseas Distinguished Specialist Adjunct Professor, Shandong University in China, and Professor of Statistics, Northwestern University, 2006 Sheridan Rd, Evanston, IL 60208 (E-mail: [email protected]). Yu Zhao is Statistician at Amazon (E-mail: [email protected]). This article is partially based on the PhD thesis of the second author. The first author was partially supported by the “111” project, grant No. B12023, at Qilu Securities Institute for Financial Studies, Shandong University in China. The authors thank Professors Tom Severini and Hongmei Jiang and the Associate Editor and the referees for helpful comments.

Yu Zhao

Wenxin Jiang is Taishan Scholar Overseas Distinguished Specialist Adjunct Professor, Shandong University in China, and Professor of Statistics, Northwestern University, 2006 Sheridan Rd, Evanston, IL 60208 (E-mail: [email protected]). Yu Zhao is Statistician at Amazon (E-mail: [email protected]). This article is partially based on the PhD thesis of the second author. The first author was partially supported by the “111” project, grant No. B12023, at Qilu Securities Institute for Financial Studies, Shandong University in China. The authors thank Professors Tom Severini and Hongmei Jiang and the Associate Editor and the referees for helpful comments.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.