Abstract
Successful analysis of superlarge data sets requires statistical procedures that automatically clean the data and uncover simple structure. The protocol described applies to multivariate industrial data from continuous manufacturing processes with feedback and feedforward control. Our methods form a twelve-step sequence that edits and re-lags the time series, as well as applying diagnostics to look for subtle data flaws. At different stages, the protocol will reject data, impute data, re-lag the time series, flag categories of suspicious data, and divide the data set into more homogeneous subsets. The result is a clean data set ready for analysis using standard statistical packages or software tools. Although there is no guarantee that every corruption has been caught and corrected, the output data set is more thoroughly examined than traditional human-intensive methods can achieve. To assist in this preliminary analysis, four graphical methods are described, which were developed in studies of glass manufacture from PPG Industries' production plants and sheet aluminum production by Alcoa.
Additional information
Notes on contributors
David L. Banks
Dr. Banks is an Assistant Professor in the Department of Statistics. He is a Member of ASQC.
Giovanni Parmigiani
Dr. Parmigiani is an Assistant Professor in the Institute of Statistics and Decision Sciences.