Mathematical expression and sampling issues of treatment contrasts: Beyond significance testing and meta-analysis to clinically useful research synthesis: Psychotherapy Research: Vol 28 , No 1

Abstract

The more two treatments’ outcome distributions overlap, the more ambiguity there is about which would be better for some clients. Effect size and t-statistics ignore this ambiguity by indicating nothing about the contrasted treatments’ outcome ranges, although the wider these are the smaller are these statistics and the more other influences than these given treatments matter for outcomes. Treatment contrast data analysis logically requires valid measurement of all the influences on outcomes. Each influence, measured or not, is somehow sampled in every treatment contrast, and the nature of this sampling affects the contrast’s two outcome distributions. Sampling also affects replications of a treatment contrast, which requires sampling that produces the same statistically expected outcome distributions for each replicate as a logical prerequisite of proper meta-analysis. Because scientific human psychology is most fundamentally about individual persons and cases, rather than aggregations of persons or cases, contrasted treatments’ outcome distributions ought eventually be disaggregated to whatever input dimension gradation configurations collapse their ranges to zero through jointly taking account of every influence on outcomes. Only then are the data about individual persons or cases and so relevant to psychotherapy theory.

Resumo

Quanto mais as distribuições de resultados de dois tratamentos se sobrepõem, mais ambiguidade existe sobre qual seria melhor para alguns clientes. O tamanho do efeito e as estatísticas t ignoram essa ambiguidade ao não indicar nada sobre a variação do resultado dos tratamentos contrastados, embora quanto mais amplas são estas, menores são estas estatísticas e mais outras influências do que estes tratamentos são importantes para os resultados .Análises de dados comparativos entre tratamentos logicamente requerem medidas válidas de todas as influências para os resultados. Cada influência, medida ou não, é de alguma forma amostrada em toda comparação de tratamento, e a natureza dessa amostragem afeta as duas distribuições de resultados que são contrastados. A amostragem também afeta as replicações de um tratamento comparado, o que requer amostragem que produza as mesmas distribuições de resultados esperadas estatisticamente para cada repetição como pré-requisito lógico de meta-análise adequada. Como a psicologia humana científica é fundamentalmente sobre pessoas e casos individuais, ao invés de agregações de pessoas e casos, as distribuições dos resultados de tratamentos comparados devem eventualmente ser desagregadas em qualquer configuração de dimensão de graduação de entrada que colapse sua variação em zero, assim, em conjunto, tendo em conta toda influência sobre os resultados. Somente desse modo os dados sobre pessoas e casos individuais são relevantes para a teoria da psicoterapia.

Zusammenfassung

Je mehr die Ergebnisverteilungen zweier Behandlungen sich überlappen, desto mehr Ambiguität gibt es bezüglich der Frage, welche Behandlung für bestimmte Klienten besser wäre. Effektgrößen und t-Statistiken ignorieren diese Ambiguität dadurch, dass sie nichts über die Ergebnisspektren der verglichenen Behandlungen aussagen. Jedoch je breiter diese sind, umso kleiner werden die Statistiken und umso wichtiger werden andere Einflüsse als die gegebenen Behandlungen für die Ergebnisse. Logischerweise bedürfen behandlungsvergleichende Datenanalysen valide Messungen aller Einflüsse auf das Ergebnis. Jeder Einfluss, ob gemessen oder nicht, wird bei jedem Behandlungsvergleich irgendwie erhoben und die Art dieser Stichprobenziehung beeinflusst die beiden Ergebnisverteilungen des Vergleichs. Auch beeinflusst die Stichprobenziehung die Replizierung eines Behandlungsvergleichs, der eine Stichprobenziehung voraussetzt, die zu den gleichen statistisch erwarteten Ergebnisverteilungen für jede Replizierung führt, welche eine logische Voraussetzung einer fachgerechten Metaanalyse sind. Da es in der wissenschaftlichen Humanpsychologie viel mehr um Individuen und Fälle als um Aggregationon von Personen und Fällen geht, sollten die Ergebnisverteilungen verglichener Behandlungen in diejenigen Gradationskonfigurierungen der Inputdimensionen disaggregiert werden, die dazu führen, dass das Ergebnisspektrum sich auf null reduziert, da alle Einflüsse auf das Ergebnis gemeinsam berücksichtigt werden. Nur dann können die Daten über individuelle Personen oder Fälle Aussagen treffen und sind somit relevant für die psychotherapeutische Theorie.

摘要

倘若兩個處遇結果分配有越多的重疊，也就表示哪一個處遇對於某些個案比較好就有越多的不確定。效果值和t考驗都無法處理這樣的不確定性，亦即未能指出對照的處遇效果之間差異有多大；雖然差異越大，統計值越小，且可能有更多其他因素比這些處遇更能影響療效結果。就邏輯性而言，處遇對照的資料分析需要確實地測量所有對於療效結果的影響。每一個影響，無論測量與否，某程度都成為每一個處遇對照的樣本，而這般抽樣的本質也會影響這兩個對照的結果分配。抽樣也影響著一個處遇對照的再製，這需要在每個再製中，都可以產生在統計學上被期待地且相同的結果分配的抽樣，亦是正確後設分析所需具備邏輯性的先備條件。因為科學的人類心理學是最以各個個體與案例為基礎，而非許多個體或案例的集合；因此對照處遇結果的分配最終應該綜合考量影響結果的每一個因素，將所有輸入面向的等級形構範圍降到零。只有這時才是各個個體或案例的資料是與心理治療理論最為相關的。

Più le distribuzioni dei risultati di due trattamenti si sovrappongono, maggiore è l'ambiguità su quale dovrebbe essere il migliore per alcuni pazienti. L'effect size e la statistica t ignorano questa ambiguità non indicando nulla sui range dell'esito dei trattamenti confrontati, sebbene più ampi siano questi range, minori siano le statistiche e sempre più siano altri elementi ad influenzare gli esiti dei trattamenti. L'analisi del confronto dei trattamenti richiede logicamente una misurazione valida di tutte le influenze sull'esito. Ogni influenza, misurata o meno, viene in qualche modo campionata in ogni confronto del trattamento, e la natura di questo campionamento influenza le due distribuzioni degli esiti del confronto. Il campionamento influenza anche le repliche di un confronto del trattamento, che richiede un campionamento che produca le stesse distribuzioni di esito statisticamente previste per ciascuna replica come prerequisito logico di una corretta meta-analisi. Poiché la psicologia scientifica è fondamentalmente basata in gran parte sulle singole persone e sui casi, piuttosto che su aggregazioni di persone o casi, le distribuzioni di esito dei trattamenti contrastanti dovrebbero alla fine essere disaggregate in modo che qualsiasi configurazione di gradazione delle dimensioni di input collassi i propri intervalli a zero tenendo conto congiuntamente di ogni influenza sui risultati. Solo allora i dati relativi a persone singole o casi saranno così rilevanti per la teoria della psicoterapia.

Keywords:

Palavras-chave:

Schlüsselwörter:

關鍵字:

Parole chiave:

Notes

1 For more on the logic and mathematics of NHST see, e.g., Abelson (Citation1997), Chow (Citation1996, pp. 13–44), Hacking (Citation1965, pp. 89–114), Kanji (Citation1993); Kline (Citation2013, pp. 67–93), Mayo (Citation1996), Mulaik, Raju, and Hadshman (Citation1997), Nickerson (Citation2000), Rodgers (Citation2010), Schneider (Citation2013), Seidenfeld (Citation1979, pp. 28–102), Ziliak and McCloskey (Citation2008). The confidence interval is often proposed to replace or supplement the NHST (e.g., Balluerka et al., Citation2005; Nickerson, Citation2000), but it is such a close relative of NHST that it involves some of the same problems and unjustifiable assumptions that NHST does, such as that chance does not operate after and independently of random assignment and that outcome unreliability is random error. Offering a third option, which is to “keep trying” or “suspend judgment” in addition to “acceptance” and “rejection” of H₀ (e.g., Mayo, Citation1996, pp. 361–411), still involves NHST and so does not go far enough. The issue of statistical power does not need to be and so is not dealt with here (see Hunter & Schmidt, Citation2004, pp. 8–15; Schmidt & Hunter, Citation2005, on why it is irrelevant).

Kline (Citation2013, pp. 95–106) shows how the NHST is misunderstood by many who employ it or rely on the conclusions of studies in which it was employed. This creates misunderstanding of what being a statistically insignificant t means, which is that the obtained t (together with all possible lesser ones) has a greater than α probability (generally 5% or 1%, and this often two-tail and so in effect α/2) of occurring if H₀ is true, which must itself be uncertain or no RCT would have been undertaken. Failing to reject H₀ is simply not enough, on clinical grounds, to consign a T_j for some serious malady to oblivion if any client may have benefitted from it (Krause, Citation2011) but properly serves instead merely to retain whatever is the T_k used as the control or baseline condition in the T_jk contrast (on which see Krause & Lutz, Citation2009a).

Because the statistical insignificance of a t depends in part upon n, it always leaves open the possibility that this insignificance may be overcome by increasing n and also thereby the RCT’s power. Increasing n enough becomes more costly or less feasible the smaller is t (see Schmidt & Hunter, Citation2005). Beyth-Marom, Fidler, and Cumming (Citation2008) provide a helpful context for thinking about all this.

Using Bayesian notions (for an introduction to which see, e.g., Kline, Citation2013, pp. 289–307) to explicate the NHST (as, e.g., Trafimow, Citation2003, does) needs mentioning, but how can the Bayesian priors P(H₀) and P(t_n₋₂ | ∼H₀) and the working assumption that there are no other options than a mutually contradictory H₀ and H_A be logically or empirically justified? Such priors are irrelevant to whether or not a t should be taken seriously, because if it was produced by a competent RCT (and its obtained results properly reported: Bakker et al., Citation2012) it obviously should be taken seriously by those who find it usefully informative no matter its size (a point vigorously scholarly made by Ziliak & McCloskey, Citation2008). Furthermore, any obtained t faces the sampling problems of the RCT regardless of whether it is adjusted by some prior (as, e.g., Kruschke, Citation2013, proposes for it).

2 Group data allows clinical prediction by interpreting group relative frequencies as individuals’ probabilities, which illogically reverses the proper direction of inference because the latter influence the former but not vice versa (see, e.g., Krause et al., Citation2011). This common misinterpretation needs to be routinely looked into by checking on the distribution of outcomes for all the individual cases subsequently assigned a treatment on the basis of a prior RCT’s or effectiveness study’s t-test or the ES of several such.

3 Obviously more than only client populations are sampled from to constitute the individual cases in an RCT_jk, but such aspects of these cases as therapists and therapy settings are generally not randomly sampled or assigned. So the causal influences from such sources are considered to be included here in population D.

4 Or else that each supposed replicate has representatively or randomly sampled these populations in some representatively or randomly different portion of their distributions, as in first cross-stratifying and then random sampling a variety of treatment units that collect comparable effectiveness data and then random sampling cases from therapists within these units who practice, e.g., cognitive behavioral and those who practice, e.g., client-centered therapy for the treatment contrast (e.g., Kish, Citation1965, pp. 148–216; McClintock, Brannon, & Maynard-Moody, Citation1979). This represents the conceiving of moderators as stratifiers of populations: “If there is variation across settings [i.e., here T_jk contrasts] that is large enough to be theoretically important, then we must identify the moderator variables that produce this variation” (Hunter & Schmidt, Citation2004, p. 513) and use them “ … to split the studies into subsets and apply meta-analysis … to each subset separately” (Hunter & Schmidt, Citation2004, p. 515).

Hunter and Schmidt (Citation2004) present the purpose of meta-analysis as representing population parameters, rather than as simply summarizing some available set of RCTs: “Conclusions limited to a specific subset of studies are not scientifically informative” (Hunter & Schmidt Citation2004, p. 397) and “ … our purpose should … be to use the flawed study findings to estimate the underlying unobserved relationships among constructs” (Hunter & Schmidt Citation2004, p. 513). In order to do so they interpret the variation in the ES included in a meta-analysis, insofar as not due to moderators (Hunter & Schmidt Citation2004, pp. 401–406), as random sampling error and so approve only random effect models of meta-analysis (Hunter & Schmidt Citation2004, p. 515 & pp. 393–399; also see Raudenbush, Citation2009): “The a priori assumption that there are no [inter-RCT] differences –made by all fixed effects meta-analysis methods– is virtually never justifiable” (Hunter & Schmidt Citation2004, p. 515). They take measurement error as well as sampling error to be random although neither is demonstrably random (i.e., has a somehow determinable statistically expected population true, or parameter, value upon which it tends stochastically to converge as sample size increases), and so both are merely presumed to be random. However, what could possibly be the population of replicates of a T_jk contrast or errors of measurement from which a random sample could conceivably be drawn? Although such assumptions of randomness are certainly proper for pure mathematics, they are logically absurd for empirical science despite their convenience.

5 Covariance adjustment of a D_jk/s_D or D_jk/s_k is simply too crude a way of taking the influence of covariates into account. This is because (a) it requires the covariates’ orthogonality with the T_jk contrast (as finessed perhaps by entering them before this contrast in a step-up multiple regression but at the cost of arbitrarily reducing its t and ES); (b) correlation does not per se entail causal influence, so a covariance-adjusted D_jk/s_D or D_jk/s_k may be misleadingly over-adjusted; and (c) measured covariates’ correlation with dependent variables are vulnerable to bias due to these covariates’ (like also the T_jk-contrast’s) correlation with Population D still unmeasured variables (including inter-covariate interactions of all orders) that are causally influential on these dependent variables (for sophisticated discussions of ANCOVA see, e.g., Aiken & West, Citation1991; Bauer & Curran, Citation2005; Miller & Chapman, Citation2001). All of each micro-contrast subgroup of cases being matched on the same configuration of covariate levels avoids the problem of (a), does not adjust but partitions the treatment contrast and so avoids (b), and as to (c) at least this maximum analysis reveals the association of highest order covariate interactions with dependent-variable distributions whereas covariance adjustment per se does not.

Mathematical expression and sampling issues of treatment contrasts: Beyond significance testing and meta-analysis to clinically useful research synthesis

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Mathematical expression and sampling issues of treatment contrasts: Beyond significance testing and meta-analysis to clinically useful research synthesis

Abstract

Resumo

Zusammenfassung

摘要

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature