Abstract
The technique introduced in this paper is a means for estimating and discovering underlying patterns for a large number of curves observed with heteroscedastic errors. Therefore, both the mean and the variance functions of each curve are assumed unknown and varying over time. The method consists of a series of steps. We transform using an orthonormal basis of functions in L 2. In the transform domain, the non-parametric regression is reduced to a means model. To estimate the means in the transform domain, we consider the class of linear or modulation estimators and proceed as in Beran and Dümbgen (R. Beran and L. Dümbgen, Modulation of estimators and confidence sets, Ann. Stat. 26(5) (1998), pp. 1826–1856.) by minimising the Stein's unbiased risk estimate. By minimising the risk over a nested subset selection of modulators, we reduce the dimensionality of the means space. We show that in the transform space, the risk estimate is asymptotically optimal in the Pinsker's minimax sense over Sobolev ellipsoids under heteroscedastic errors. Coefficient estimation and dimensionality reduction via optimal risk estimation is essential for accurate clustering membership estimation. We illustrate our technique by estimating and clustering a large number of curves both within a synthetic example and within a specific application. In this application, we analyse the research and development expenditure of a subset of companies in the Compustat Global database. We show that our method compares favourably to two alternative approaches.
Acknowledgements
The author is grateful to Larry Wasserman for his input in some of the proofs in this paper.