164
Views
7
CrossRef citations to date
0
Altmetric
Statistics in Health

Scaling of True and Apparent ROC AUC with Number of Observations and Number of Variables

Pages 771-781 | Received 06 Aug 2004, Accepted 11 Mar 2005, Published online: 15 Feb 2007
 

ABSTRACT

New technologies have recently emerged which enable simultaneous evaluation of large numbers of biological markers. The resultant marker data are often used to build predictive models which claim to be able to distinguish between two or more classes of subjects. However, when there are a large number of variables and a small number of observations, the problem of overfitting arises, where the model parameters are optimized for the observed data but may fit poorly for independent data. Here we illustrate how various quantities related to true and apparent predictive ability scale with the number of markers and the number of observations (subjects). Specifically, we utilize a model which takes the form of a linear combination of a subset of marker variables; the model produces a propensity score which generates an ROC curve and corresponding area under the ROC curve (AUC), which is a measure of predictive ability. Given the true marker distributions, there is a parameter value so that the resulting predictive model gives the optimal true AUC. In practice, the true distributions are unknown, so experimental data are used to derive a parameter value which produces the optimal apparent AUC, where the “apparent” AUC is based on the observed instead of the true distributions. If the above model with the estimated optimal parameter is then used on an independent data set, it would have an actual AUC derived from the estimated optimal parameter and the true marker distributions. The difference between the apparent AUC and the actual AUC can be denoted as the total error in estimating predictive ability. This total error can be additively decomposed into the “overfitting error”, namely, the apparent AUC minus the optimal AUC, and the “mis-specification error”, namely the optimal AUC minus the actual AUC. We focus here on how these errors scale with the number of observations and the number of markers, where the latter are divided into “null” markers which contain no information as to class status and “associated” markers which are related to class status.

Mathematics Subject Classification:

Notes

Note: f and F are standard normal density and CDF, respectively. R is subsect of fixed size of markers, with I R  = 1 if the ithe marker is in R and 0 otherwise. U iΔ and U iΔ are the true and observed differences in mean value between classes for the ith marker, respectively, and is the (known) variance within each class of the value of the ith marker.

a M1 < 5 and M1 > 0.05*M are excluded.

b Observed difference in class means has variance 1/n.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 1,090.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.