635
Views
9
CrossRef citations to date
0
Altmetric
SPECIAL SECTION: TEST VALIDATION

Is My Test Valid? Guidelines for the Practicing Psychologist for Evaluating the Psychometric Properties of Measures

, &
Pages 261-283 | Published online: 07 Dec 2011
 

Abstract

A general logic for data-based test evaluation based on Slaney and Maraun's (2008) framework is described. On the basis of this framework and other well-known test theoretic results, a set of guidelines is proposed to aid researchers in the assessment of the psychometric properties of the measures they use in their research. The guidelines are organized into eight areas and range from general recommendations, pertaining to understanding different psychometric properties of quantitative measures and at what point in a test evaluation their respective assessments should occur, to clarifications of core psychometric concepts such as factor structure, reliability, coefficient alpha, and dimensionality. Finally, an illustrative example is provided with a data-based test evaluation of the Hare Psychopathy Checklist-Revised (Hare, 1991) as a measure of psychopathic personality disorder in a sample of 384 male offenders serving sentences in a Canadian correctional facility.

Acknowledgments

We thank M. Maraun for providing comments and suggestions on an earlier draft of the manuscript.

Notes

By this we mean a function of the item responses that would result when a population of individuals under study is (hypothetically) exposed to the measurement operation (i.e., in which responses to each item—composed of an item stem and a set of response options—is provided by or for each participant).

In the interests of providing a general treatment, we shall henceforth use the term attribute to designate whatever property, process, characteristic, theoretical entity, or general phenomenon that a given test is presumed to measure. Moreover, since the aim of data-based test evaluation is to form “reliable” and “valid” composites of items which will be entered into analyses aimed at testing substantive hypotheses of interest, for expository clarity, we shall presume that a test or measure consists of a set of items designed to measure a single common attribute. The basic logic we present can be easily extended to tests that have been designed to measure multiple attributes.

See Slaney and Maraun (Citation2008) for a more detailed description of the framework.

Elsewhere (e.g., Slaney et al., Citation2009; Slaney et al., Citation2010) these have been called internal score validity, score precision, and external score validity in order to emphasize that the psychometric features with which our framework is concern are properties of scores (i.e., composites) and not of tests, per se.

Often this is couched in terms of the multidimensionality versus unidimensionality of a set of items.

One exception might be if the aim is merely to create a composite that predicts some criterion or that sorts individuals in a particular way without the assumption that the items measure a common property. In such cases, a composite would be viewed as more of a heuristic than an indicator of an attribute. Bollen and Lennox (Citation1999) and Edwards and Bagozzi (Citation2000) describe the distinction between these two different views of measurement in the context of their discussion of reflective versus formative models. We also elaborate briefly on this distinction below.

The reason for this is that the form of logic on which statistical modeling is based is modus tollens. Thus, without considering inferential error, if the material implications of a given model hold, this means only that the particular model is one of potentially many equally feasible models, each of which imply the same material implication. However, if the material implication is not realized, this means that the model in question can be ruled out with certainty (within inferential tolerances) because p→q is equivalent to ∼q→∼p (see Slaney and Maraun, Citation2008, for an explication of this point).

Also, it should be noted that although most of the measurement models used in psychology are latent variable models, they need not be. It just so happens that latency has in this context been taken to represent a received view of psychological measurement according to which constructs are unobservable and may only be measured indirectly (see Maraun & Peters, Citation2005; Maraun, Slaney, & Gabriel, 2009; Maraun, Citation2010; Slaney, Citationin press). Below, we address the confusions that occur over the use of latent and other psychometric concepts.

However, this is not to say that any technique will necessarily be appropriate or useful in a given situation in which the researcher's aim is either exploratory or confirmatory, just that there are no logical grounds for deeming techniques appropriate or not on the basis of the technique itself.

Because EFA and CFA appear so much in the applied test analytic literature, we would like point out that although EFA may have legitimate uses in other areas of test analysis, it should not, generally speaking, be advocated for use in data-based test evaluation, primarily for the reason that anything that could be done with an EFA (i.e., test the fit of a particular m-dimensional solution) can be much better handled within a CFA framework.

Parallelism is a testable property of a set of item responses and should not merely be assumed to hold.

Guttman (Citation1945) would note that producing parallel measures such that reliability could be estimated by the correlation between the two would be plagued with ineluctable practical difficulties and that the best one could hope for from correlations between tests was an estimate of lower bounds to reliability. He would produce results on six such lower bounds in the 1945 paper, one of which was a generalization of Kuder and Richardson's KR20 (1937) that would be further elaborated by Cronbach in 1951 and which is known today as variously as Coefficient Alpha, Cronbach's Alpha, and the Guttman-Cronbach lower bound.

In fact, test-retest estimates conflate consistency of measurement and stability of attribute; however, these properties can, at least in theory, be de-conflated and estimated separately with the use of models of measurement that are more sophisticated than the comparably simpler classical true score model.

And, furthermore, will for all 0 become larger as k increases.

A common strategy used to overcome this limitation has been to estimate a nonlinear correlation matrix and an associated asymptotic covariance matrix to use as inputs for a generalized least squares estimation of the linear factor model parameters and chi-square fit statistic. Although this strategy has been advocated through a number of methods (e.g., Jöreskog, Citation1993; Jöreskog & Sörbom, Citation1988a, 1988b; Muthén, Citation1984; 1987), Steiger (Citation1994) has questioned the feasibility of such methods and whether their employment does in fact offer the fix they were intended to provide for the nonlinearity of the item or attribute regressions.

The reader is referred to Alison (Citation2002) and McKnight, McKnight, Sedani, and Figueredo (2007) for two non-technical treatments.

However, it is true that there has been equivocation in the uses of internal consistency, item homogeneity, and unidimensionality (see Hattie, Citation1984; McDonald, Citation1981). The important point here is to distinguish clearly between a class of internal consistency reliability estimates and the properties of item homogeneity and unidimensionality, respectively, and to keep clear that reliability estimates generally speaking should not be used as indices of either homogeneity or unidimensionality.

Note that Cooke and Michie (Citation2001) further organized each of the facet scales into testlets. For methodological reasons, we will not combine items into testlets for our analysis.

We thank these authors for very kindly providing us with a copy of the LDIP program for this analysis.

Note that in choosing a principal component model as the QC of this formative FS of the PCL-R, we are effectively treating the linear composite that we will form on the basis of the model as an error-free measure of the attribute which it defines. We recognize that in practice it would rarely be satisfactory to constrain the general attribute of psychopathy as defined exhaustively by the twenty PCL-R items and thus a formative model that includes an error component (such as is described in Bollen & Lennox, 1991, and Edwards & Bagozzi, Citation2000) would be desired. We opt for the principal component model here for reasons of parsimony and ease of illustration.

As noted above, most indices of precision are not well suited for assessing the precision of a formative composite. One exception is test-retest coefficients. However, since the data illustrated here were used in a study that did not contain a retest component, we cannot provide such an estimate.

Ideally, we would have employed two separate ordinal logistic regression analyses given that the criterion variable in both cases (i.e., public safety and criminal risk ratings, respectively) are each measured on 3-point graded scale and the predictor (i.e., PCL-R total score) is graded with many categories. However, because an ordinal regression requires that none or very few of the cells are empty, given the sparseness of the current data in that regard, we opted to conduct an ANOVA and flip the model around in our tests of main effects.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 214.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.