1,847
Views
56
CrossRef citations to date
0
Altmetric
ISSUE FORUM: ADVANCES IN QUANTITATIVE ANALYSIS

The FAQs on Data Transformation

Pages 379-397 | Published online: 01 Dec 2009
 

Acknowledgements

The author acknowledges the initial work on this topic coauthored with Professor C. L. Bauer of Marquette University and the kind insistence by Professor D. J. Hample that this topic be included in Communication Monograph's Issues Forum. Special thanks go to Professor D. A. Cai for her accurate challenges to what I thought was clear writing. In addition, I wish to thank Professor M. R. Allen and Ms. I. A. Cionea for their many helpful comments, and Professor A. F. Hayes for his thoughtful and thorough reading and suggestions. Unfortunately, any remaining errors are the author's.

Notes

1. Technically, residual refers to sample data whereas error refers to population data and, presumably, the true model. Assumptions apply to errors, but assumptions are evaluated by examining residuals. This distinction is important when tests of assumptions are discussed.

2. The idea that a variable is “well-behaved” may refer to one or several aspects of a variable. One meaning, which is emphasized here, is that the variable of interest has a relatively symmetric distribution. In other cases the term can refer to a dependent variable that has homoscedastic residuals in a theoretically sensible linear regression.

3. Indeed, a variable cannot actually be distributed normally: The tails cannot go to±∞ and actual data cannot be absolutely continuous. However, a variable may approximate a normal distribution.

4. The transformation here is both more general and more limited than the standard Box-Cox power transformation (see Box & Cox, Citation1964; Whistler, White, Wong, & Bates, 2004, p. 155), because we have added a constant but have not used a function that incorporates λ and the geometric mean to keep the units of measurement constant. In addition, if one wished to have the transformation correlate positively with the original scores, one can divide the transformation by λ or multiply the transformation by −1 when λ is negative. See also Bauer and Fink (Citation1983) and Fox (Citation1997, p. 322) regarding this matter. In SHAZAM (Whistler et al., Citation2004), one may transform (1) the dependent variable only, which is referred to as the classical Box-Cox model; (2) the dependent variable and the independent variables to the same value of λ, which is referred to as the extended Box-Cox model; (3) the independent variables only, each to its own value of λ, which is referred to as the Box-Tidwell model; and (4) all variables, independent and dependent, each to its own value of λ, which is referred to as the combined Box-Cox and Box-Tidwell model.

5. The constant k serves two purposes. First, some values of λ will result in Y* being undefined: For example, the logarithm of Y, if Y ≤ 0, is undefined, as is the square root of a negative number. Thus, if the transformation requires that all Ys be nonnegative or positive, and some values violate this condition, then a k can be selected that corrects this problem. Hamblin (1971a, 1971b) associates this constant with correcting for the origin in ratio scales. Mosteller and Tukey (Citation1977) call transformations that employ an additive constant “started” transformations, as in “started logs” and “started roots.” Second, in addition to varying λ, k can be varied to search for the optimal single-bend transformation.

6. Some nonlinear relations may not be able to be converted to linearity by transforming the original data. Such nonlinear relations are referred to as intractable.

7. A distinction needs to be made between linear in parameters and linear in variables. For example, a regression equation is of the form Ŷ=b 0+b 1 X 1 ++b k X k, whereŶ is the predicted value of the dependent variable, b 0 is the intercept, and b 1, … , b k are the coefficients of X 1,… , X k, respectively. This equation is linear in parameters (the set of coefficients to be estimated). Note, however, that any given independent variable could be a variable that is raised to a power (e.g., X 2), that is an argument of an arithmetic or statistical function (e.g., log[X]), that is a nonlinear combination of variables (e.g., X q×X p ), or of a form other than a variable to the first power. A regression is linear in variables if the variables in the methods regression are variables to the first power. The general linear model is appropriate for equations that are linear in parameters regardless of whether they are linear in variables.

8. There are methods to analyze bounded or truncated variables, such as Poisson or negative binomial regression, tobit regression, probit regression, and ordinal logit regression. They may be statistically appropriate alternatives to data transformation (Aldrich & Nelson, Citation1984; Long, Citation1997). However, the analyst also needs to consider whether these methods elucidate the interplay of theory and measurement that is fundamental to the discussion in this paper.

9. Some of this discussion is taken with little change from Fink et al. (2006).

10. Many authors of the literature on data transformation pose this same question for their readers, reflecting in part the social scientist's lack of familiarity and practice with data transformation.

Additional information

Notes on contributors

Edward L. Fink

Edward L. Fink is a professor in the Department of Communication at the University of Maryland

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 183.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.