8,199
Views
1
CrossRef citations to date
0
Altmetric
Articles

Critical review of the use of the Rorschach in European courts

ORCID Icon, ORCID Icon & ORCID Icon

Abstract

In relation to the admissibility of evidence obtained using projective personality tests arose in F v. Bevándorlási és Állampolgársági Hivatam (2018). The Court of Justice of the European Union has held that an expert’s report can only be accepted if it is based on the international scientific community’s standards, but has refrained from stipulating what these standards are. It appears timely for European psychologists to decide what standards should be applied to determine whether or not a test is appropriate for psycholegal use. We propose standards and then apply them to the Rorschach because it was used in this case and is an exemplar of projective tests. We conclude that the Rorschach does not meet the proposed standards and that psychologists should abstain from using it in legal proceedings even in the absence of a clear judicial prohibition.

The applicant in F v. Bevándorlási és Állampolgársági Hivatal (“F v. HungaryCitation2018) was a Nigerian male who applied for asylum in Hungary claiming that he feared persecution in Nigeria because of his homosexuality. The Hungarian government (Government) asked a panel of psychologists to assess the veracity of the applicant’s claim that he was homosexual before granting him asylum. The panel could neither confirm nor deny the applicant’s sexual orientation and therefore the Government denied his application for asylum on the basis that it could not establish his general credibility. The applicant took the matter to the Hungarian Court of Appeal, which asked the Court of Justice of the European Union (Court) to make a preliminary finding regarding the use of an expert report by psychologists who had used three projective tests (Draw-A-Person-In-The-Rain, Rorschach and Szondi). The Court in its decision made two comments and one ruling that are of importance in this article. The first comment is that in circumstances such as in this case ‘it must be considered that consent of the person concerned […] is not freely given’ (F v. Hungary, Citation2018, para. 53) and that, under these circumstances, tests might be used ‘only if they are necessary and genuinely meet the objectives of […] the European Union or the need to protect the rights and freedoms of others’ (F v. Hungary, Citation2018, para. 55). It further commented that the use of projective tests (i.e. tests that use non-structured, unclear stimuli such as ink blots to induce responses; see Erickson et al., Citation2007) had been vigorously contested during the case. The Court made no decision regarding these points, saying that these matters would have to be decided by the relevant government and national court in each case. The Court continued to assert that an expert’s report is only acceptable ‘if it is based on sufficiently reliable methods and principles in the light of the standards recognised by the international scientific community’ (F v. Hungary, Citation2018, para. 58). However, it did not state what these standards are or decide whether or not it was acceptable to use the three projective tests, saying that this was a decision that the national court had to consider on the facts.

Lawyers criticised the judges in F v. Hungary (2018) for failing to provide guidance as to how authorities can decide whether or not a psychological test is a method that meets international standards, and they made negative comments about psychological tests in general (e.g. Ferreira & Venturi, Citation2018; A. Gould, Citation2019). These negative comments were mostly aimed at projective tests, but they reflect on all the tests that psychologists use – and indeed on the profession itself. We therefore believe that the decision of the court and the lawyers’ comments require European psychologists to publicly declare what standards they and lawyers should use to determine whether or not tests are ‘sufficiently reliable methods […] in the light of the standards recognised by the international scientific community’ (F v. Hungary, Citation2018, para. 58).

We will first discuss the relevant international standards and propose a set of contemporary standards that psychologists and lawyers can use to determine whether or not a test is sufficiently reliable for use in psycholegal work. We will then use these standards to critically consider the use of the Rorschach in court because (a) it is frequently used in some European countries (see Areh, Citation2020) and (b) it is an exemplar of projective tests, as well as being the best known, most thoroughly researched and arguably most frequently used of the three projective tests that were used in F v. Hungary (Citation2018).

International standards

The legal question that the Court faced in F v. Hungary (Citation2018) is whether or not the expert witnesses based their testimony on theories and techniques that fall into an area of expertise. Courts must generally answer this question when they decide whether or not to admit testimony based on emerging theories and techniques, and the different approaches used to do this are beyond the ambit of this article (for a discussion, see Freckelton, Citation2019). It is enough to say here that courts, in many countries, share the Court’s reluctance to set out the specific standards that should be used to determine whether or not a specific theory or technique can be accepted as an area of expertise (Freckelton, Citation2019). The United States (US) Supreme Court is, however, an exception to this general rule. In Daubert v. Merrell Dow Pharmaceuticals (Citation1993), it indeed set out four requirements (the Daubert-standards) that are very influential – even in jurisdictions where courts have not explicitly adopted them (Freckelton, Citation2019).

Daubert-case

The plaintiffs in this case alleged that a drug called Bendectin had caused the birth defects of their children. The defendant called an expert who, after reviewing the extensive published scientific literature, concluded ‘that the maternal use of Bendectin has not been shown to be a risk factor for human birth defects’ (Daubert v. Merrell Dow, Citation1993, p. 582). The plaintiff in response called eight expert witnesses who based on novel animal studies and chemical analyses of the drug concluded that Bendectin can cause birth defects. The Federal Court could not consider the opinion of these eight witnesses because it was bound by the rule laid down in Frye v. United States (Citation1923) ‘that expert opinion based on a scientific technique is inadmissible unless the technique is “generally accepted” as reliable in the relevant scientific community’ (Daubert v. Merrell Dow, Citation1993, p. 584). The Federal Court therefore ‘declared that expert opinion based on a methodology that diverges “significantly from the procedures accepted by recognized authorities in the field […] cannot be shown to be” generally accepted as a reliable’ (Daubert v. Merrell Dow, Citation1993, p. 584). The Supreme Court, however, held that the rule in the Frye-case did not apply anymore and that judges must in each case assess the validity of scientific testimony (see Faigman, Citation1995). The Supreme Court, with reference to the work of Popper (Citation1989), asserted:

Many considerations will bear on the inquiry, including whether the theory

or technique in question can be (and has been) tested, whether it

has been subjected to peer review and publication, its known or potential

error rate and the existence and maintenance of standards controlling

its operation, and whether it has attracted widespread acceptance

within a relevant scientific community. The inquiry is a flexible one,

and its focus must be solely on principles and methodology, not on the

conclusions that they generate. (Daubert v. Merrell Dow, Citation1993, p. 580)

In Kumho Tire Company v. Carmichael (Citation1999), the Supreme Court clarified that this so-called Daubert test applies to all expert testimony (see Beecher-Monas, Citation1998), and the standards of this test have been applied to psychology in the US (see United States v. Hall, Citation1996). Both lawyers (e.g. Beecher-Monas, Citation1998) and psychologists (e.g. Melton et al., Citation2018) have accepted that a psychological test – defined as ‘a set of items that has accepted levels of reliability and validity and allows measurement of some attribute of an individual’ (Australian Psychological Society, Citation2014, para. 7.1) – is a technique as envisaged in the Daubert case. The Daubert test is, however, very broad, and each discipline and profession should therefore consider how the test applies to it.

Testability

Psychologists generally interpret testability to mean that the theory upon which the psychological test is based must have been empirically examined and challenged (J. W. Gould et al., Citation2013). Lawyers such as Beecher-Monas (Citation1998) have observed that this standard might create a dilemma for those parts of psychology that rely mainly on retrospective observational studies and not controlled experimentation. However, as with all the other Daubert standards, testability is a necessity and of relevance to psychological testimony (see Bow et al., Citation2006).

Peer review and publication

Psychologists have pointed out that the second standard, namely whether or not the test ‘has been subjected to peer review and publication’ (Daubert v. Merrell Dow, Citation1993, p. 593), should be interpreted with caution given publication biases (see Rothstein et al., Citation2005). Scholars have pointed out that authors prefer to submit – and editors prefer to publish – positive results, thereby favouring the alternative versus null hypothesis and causing a positive-results bias (Nosek et al., Citation2012). Consequently, the results of meta-analyses – which psychologists consider the most reliable method of assessing acceptance (Marlowe, Citation1995) – can be distorted, contributing to the survival of poor-quality theories (Ferguson & Heene, Citation2012).

Error rate

Lawyers (e.g. Beecher-Monas, Citation1998) and psychologists (e.g. J. W. Gould et al., Citation2013) interpret the ‘known or potential rate of error’ (Daubert v. Merrell Dow, Citation1993, p. 594) as a reference to psychometric properties of tests like their validity and reliability. Validity can be in the form of construct (i.e. does it measure the construct of interest, such as psychopathy?) or predictive or discriminative (i.e. how accurate is it in the prediction of the behaviour of interest, such as that an offender will reoffend?) validity. Reliability refers to measurement consistency, precision and repeatability, and the relevant coefficients indicate the degree to which test scores are free from error (Groth-Marnat & Wright, Citation2016).

General acceptance

The general acceptance of a psychological test can be determined by examining either the frequency of its use or its general acceptance as a credible test by the scientific community, as recorded in notable scientific publications (e.g. McCann & Evans, Citation2008). The test must therefore meet the profession’s requirements for valid and reliable psychological tests, which we discuss next.

Professional requirements

Professional bodies such as the American Educational Research Association (Citation2014), the International Test Commission (Citation2013) and the United States National Council on Measurement in Education (Citation2014) have developed standards for psychological tests in general. The uniqueness of assessments for courts (Melton et al., Citation2018), however, has prompted professional bodies representing psychologists who provide psycholegal services – like the American Psychological Association (APA, Citation2013) and the Australian Psychological Society (Citation2013) – to stipulate the standards that their members should adhere to when conducting assessments for the courts. The APA stresses that tests should have an ‘adequate scientific foundation’ (APA, Citation2013, Guideline 2.05) and measure ‘response style, voluntariness, and situational stress associated with the involvement in forensic or legal matters’ (APA, Citation2013, Guideline 10.02). Authors also contribute to the literature with papers exploring the importance of issues such as the standardisation of test administration (Lee et al., Citation2003).

Even before the Daubert case, psycholegal scholars considered what standards psychologists could use to justify their choice of tests for forensic assessment (e.g. Heilbrun, Citation1992). Many of their standards are similar to what the profession generally requires of psychological tests. Heilbrun (Citation1992) for instance required that tests should be commercially available and have manuals that provide standardised administration procedures, scoring instructions, population norms to assist with interpretation and finally information which demonstrates that they have acceptable levels of validity and reliability. Psychologists like Melton et al. (Citation2018) have, since the Daubert case, specifically considered its impact on psychologists’ choices of tests in psycholegal contexts.

Ethical requirements

Psychologists who do psycholegal work should further adhere to the ethical principles and standards of their profession (see Allan, Citation2013, Citation2018). The Meta-Code of Ethics of the European Federation of Psychologists’ Associations (EFPA, Citation2005) provides a useful guide in this regard. Paragraph 3.1.2 requires psychologists to collect only the information that they need for the professional tasks they are undertaking. Paragraph 3.2.3 reminds psychologists to consider the limits of the procedures that they use and the limits of the conclusions that can be drawn from data collected using such procedures, further requiring them ‘to practise within, and to be aware of the psychological community’s critical development of theories and methods’. Paragraph 3.3.1 reflects on psychologists’ obligation under the Responsibility principle not to bring the profession into disrepute (EFPA, Citation2005), which requires them to avoid behaviour that might weaken courts’, lawyers’ and the public’s perception of the profession as trustworthy (see Allan, Citation2013, Citation2018). Psychologists should also promote and maintain high standards of scientific and professional activity (EFPA, Citation2005, para. 3.3.2) while minimising reasonably foreseeable and preventable harm (EFPA, Citation2005, para. 3.3.3).

Possible standards

The four Daubert standards are so well established and recognised that it makes sense to use them as the basis of any proposed European standards for the use of psychological tests in psycholegal assessments. They do, however, need further explanation to make them useful to lawyers and psychologists – and they do not cover all the basic requirements found in the ethical codes and guidelines for psychologists. We believe that the standards identified can be consolidated under five headings as follows.

Theoretical basis with peer-reviewed support

Beecher-Monas (Citation1998) stated that the most important Daubert standard is that the test must have a sound and tested theoretical basis, as well as supporting data accessible in peer-reviewed journals. The requirement that any supporting research should have been published in reputable peer-reviewed journals serves to verify that the research has been scrutinised by objective and competent peers and that the data has been made available to those who need it, such as forensic practitioners and lawyers (Allan, Citation2020). Psychologists should control for publication bias when they evaluate the level of support in the literature for the theoretical basis of an instrument.

Validity

Psychologists use tests in the psycholegal context to measure constructs that are legally relevant in order to obtain information that can be used when preparing a psycholegal report and testimony (Grisso, Citation1986). Tests must therefore be valid in that they measure the construct that the psychologist, and ultimately the court, wants to know about. Psychologists should when possible use tests that are standardised for the population that the examinee is part of (Allan, Citation2013, Citation2018, Citation2020; Allan et al., Citation2019). The exact form of validity that is most important will depend on the legal issue in question (Heilbrun, Citation1992).

Construct validity, which subsumes all other types of validity (see Messick, Citation1995), is relevant if psychologists want to measure the presence of a specific construct, such as a mental disorder. Criterion (predictive) validity coefficients of psychological tests typically vary from .30 to .40, with values above .60 rare (Kaplan & Saccuzzo, Citation2017). This is poor when compared with medicine, where convergent validity values below .70 are generally considered as indications of validity problems (see Post, Citation2016). Given the impact that legal decisions can have on people’s rights and interests (see Allan, Citation2018), psychologists working in the psycholegal context should therefore ideally use tests that have validity coefficients well above .60.

Discriminant and/or predictive validity is important when psychologists want to identify people who are likely to reoffend. Predictive validity is typically determined by calculating the area under the curve (AUC), which is a statistical index that represents the average difference in true positive and false positive rates across all possible cut-offs (Allan et al., Citation2006). The AUC can range from 0 to 1 (suggesting perfect performance) with an AUC of .50 suggesting a no better than random prediction. Sjöstedt and Grann (Citation2002) considered AUC values of ≥ .60 but < .70 as marginal, ≥ .70 but < .80 as modest, ≥ .80 but < .90 as moderate and ≥ .90 as high. Forensic practitioners do not stipulate what they consider as an acceptable AUC but in the scientific literature .70 is generally considered minimally acceptable (Rice & Harris, Citation2005; Steyerberg, Citation2009; Szmukler et al., Citation2012). It is not possible to recommend a generalised cut-off score for all kinds of validity measures; yet, psychologists should thoroughly consider which type of validity is relevant in each specific case and choose only tests that have the highest validity values.

Psychologists should only collect data that are relevant to the professional tasks that they are undertaking (EFPA, Citation2005, para. 3.1.2) and, therefore, the incremental validity of tests is important. This form of validity requires that a test should produce accurate information beyond what can be obtained more easily and cheaply with other methods (see Haynes & Lench, Citation2003; Hunsley & Meyer, Citation2003). Compared to other contexts, incremental validity is arguably more important in psycholegal contexts, wherein legal and practical restraints usually compel psychologists to collect as little data as possible in the most cost-efficient manner.

It is inevitable that many people who are tested in high-stakes situations will try to manipulate the outcome of an assessment (see Anglim et al., Citation2018). Examinees might therefore exaggerate or fabricate psychological symptoms for external gain, such as to evade criminal responsibility or attain financial compensation. The exact degree of malingering is unknown, but Mittenberg et al. (Citation2002) have estimated that 19% of examinees in criminal cases exaggerate their symptoms. Cartwright and Roach (Citation2016) furthermore found that 25.4% of their participants admitted having done so after a road traffic accident. Tests used in a psycholegal context should therefore ideally provide a method of checking response style to detect manipulation (see Paulson et al., Citation2019).

Reliability

Reliability in law generally refers to a legal threshold for the admissibility of evidence (see Edmond, Citation2012; Freckelton, Citation2019), but we follow the Daubert standards by using the word in the scientific sense (see Popper, Citation1989). The indices of reliability that are relevant in a specific case will depend upon the test and the circumstances. In the psycholegal context, it will generally be test-retest reliability for measuring trait variables and interrater reliability for tests in which professional judgement plays a significant part in combining and interpreting data. Historically, psychologists have considered reliability coefficients above .60 as acceptable in clinical work (see Fleiss et al., Citation2003), but contemporary authors suggest that reliability should be ‘around .90 for clinical decision-making and around .70 for research purposes’ (Groth-Marnat & Wright, Citation2016, p. 10). Court decisions, however, have major consequences for litigants and therefore it is important to minimise the subjective biases of individual forensic assessors, which suggests a minimum interrater coefficient above .80 (see Heilbrun, Citation1992; Lilienfeld et al., Citation2000) and ideally above .90, with Nunnally and Bernstein (Citation1994) recommending .95.

Availability of test and support documentation

Heilbrun (Citation1992) requires that tests should be freely available to other psychologists and lawyers involved in the case. There are two reasons for this requirement. First, all expert witnesses must have access to the relevant test and supporting material to prepare to testify (see Pownall v. Conlan Management, Citation1995; R v. Turner, Citation1975). Second, lawyers need access to these materials in order to prepare to cross-examine and evaluate the testimony of expert witnesses. Today this could mean that tests and their supporting documentation should be available commercially or online in official repositories. The supporting information could be in a manual, book or secure website and must inform readers about the test’s theoretical basis and development, references to published peer-reviewed research supporting it and its psychometric properties. The manual should also provide information regarding the standardised administration and scoring of the test, and how to use the available information and population norms to assist with interpretation (Australian Psychological Society, Citation2014, para. 7.1).

Ethical considerations

The ethical issues that might arise will depend on the test and the circumstances in which it is used, and we therefore add other relevant ethical concerns to our standards. There are some requirements that will be relevant in almost all cases, and the first of these is that there must be evidence that contemporary psychologists accept the test as representing best practice (also see the general acceptance standard in the Daubert case). This requirement has two tiers in that the test must be accepted not only by the general profession for general practice but also by psychologists who carry out forensic work. It is important to consider the views of those who undertake psycholegal work because they have the best understanding of the demands of courts and the legal rules that regulate the admission of psychological evidence.

Psychologists who carry out psycholegal work should in priority be using tests that provide relevant and reliable data, and consider whether or not using the tests poses a risk of harm to litigants’ rights and interests (see Allan, Citation2013, Citation2018). Psychologists should also consider the developers’ and users’ governance of the tests (EFPA, Citation2005, para. 3.2.2; Mustac v. Medical Board of Western Australia, Citation2007) and important data related to it – for instance, whether or not potential examinees can easily access information that will assist them in manipulating test results. This is becoming increasingly important as it becomes possible to place information on websites (Cartwright et al., Citation2019). Finally, psychologists should avoid any practice that might bring the profession into disrepute, a risk that is much higher when using tests in psycholegal work as this generally takes place within the public eye and can attract publicity.

Assessment of the Rorschach

Since Hermann Rorschach published the test named after him in his book Psychodiagnostik (Rorschach, Citation1921/1951), many authors have written about its strange origins and development (e.g. Krishnamurthy et al., Citation2011; Meyer, Citation2017; Schott, Citation2014). Many believe that this projective personality test is more effective than its structured self-report counterparts (e.g. Finn, Citation1996) because it goes beyond conscious and behavioural functioning (e.g. Stenius et al., Citation2018). Rorschach supporters believe that it circumvents examinees’ conscious defences because they respond to ambiguous stimuli with a minimum of instructions (see Siipola & Taylor, Citation1952). They therefore believe that it provides them information about examinees’ automatic processes and unconscious, structural and longitudinal functioning (see Weiner et al., Citation1996). Furthermore, they claim it can detect (see Cerney, Citation1990) and assess the authenticity of (see Leavitt & Labott, Citation1996) recovered repressed traumatic memories about abuse (however, for a critique see Otgaar et al., Citation2019).

If the assumption about the Rorschach’s ability to circumvent examinees’ conscious defences is correct then the Court’s comment in the case of F v. Hungary (Citation2018), as mentioned in the introduction, needs to be considered in every forensic assessment when projective tests are used. Namely, the Court stated, ‘it must be considered that consent of the person concerned […] is not freely given’ (F v. Hungary, Citation2018, para. 53) and that tests might be used ‘only if they are necessary and genuinely meet the objectives of […] the European Union or the need to protect the rights and freedoms of others’ (F v. Hungary, Citation2018, para. 55). Furthermore, the assumption also raises the question as to whether or not the use of the Rorschach leads to violations of the right to avoid self-incrimination and the right to remain silent when questioned, either prior to or during legal proceedings in a court of law.

The Rorschach is not a unitary test (Exner, Citation1969) because there are at least eight well-known systems that differ notably due to being developed by people with different theoretical and professional backgrounds (Groth-Marnat & Wright, Citation2016). Examiners can therefore administer, code and interpret the Rorschach in different ways (Bornstein & Masling, Citation2005). Exner (Citation1974, Citation2000, Citation2008) tried to overcome this confusion by developing the Rorschach Comprehensive System (hereafter Comprehensive System). This is the most frequently used version of the Rorschach and it provides normative information for non-patient adults and children, coupled with statistical tables for some clinical groups (Exner, Citation2008). Meyer et al. (Citation2007) further published international norms for the Rorschach known as the Composite International Reference Values (CIRV). The Rorschach has always been problematic (e.g. S. J. Beck, Citation1937), controversial (Dawes, Citation1996) and the subject of many professional debates (e.g. Hibbard, Citation2003; Mihura et al., Citation2015; Wood et al., Citation2015). The development of the Rorschach continues (e.g. Gurley, Citation2017), but despite a century’s extensive research, development and use, it remains divisive (Groth-Marnat & Wright, Citation2016).

Test must have a theoretical basis and supporting empirical data that have been peer reviewed and published in scientific journals

The projective hypothesis was the original theoretical assumption behind the Rorschach (Rorschach, Citation1921/1951). At first, it was based on Freud’s (Citation1911) theory that people unconsciously assign their characteristics and impulses to others as defence mechanisms. Nowadays the theoretical underpinning for projective tests such as the Rorschach is the assimilative projective assumption (see Sundberg, Citation1977). The premise here is that people’s understanding, interpretation and explanation of vague and unstructured test stimuli reflect their internal constructs, such as their feelings, needs, personal experiences, thought processes, conflicts and impulses. However, empirical evidence shows that implicit and explicit processes might interact to determine task performance on implicit tasks (Dunn & Kirsner, Citation1989). People’s expectations (see Sherman et al., Citation2016) and the social context within which the testing takes place (see Otten et al., Citation2017) might therefore influence examinees’ responses.

There is also little theoretical support for the inferences that examiners draw from various aspects of cards, such as colour and personality. For instance, psychologists associate attention to red stimuli with anger, impaired performance on cognitive tasks, sexualised behaviour, increase aggressiveness, dominance, caution and avoidance behaviour (see Barchard et al., Citation2017; Tham et al., Citation2020). It is, however, possible that examinees’ responses might reflect learned associations rather than their personalities because it is known that emotion-colour associations vary across cultures (e.g. Hupka et al., Citation1997).

The Rorschach meets the peer-review requirement because it has attracted much scholarly interest since its publication. A search for the term ‘Rorschach’ on the PsycINFO© database reveals 6679 peer-reviewed publications between 1921 and 2019, with about 52% of them published from 1980, and an average of 86.02 peer-reviewed manuscripts per year from 1980 to 2019. This amount of research has led to several meta-analyses, which had mostly confirmed the effectiveness of the Rorschach (e.g. Parker et al., Citation1988). However, more recently the results of these meta-analyses appear to have been affected by publication bias and serious methodological flaws, leading to incorrect and inflated effect sizes (see Erickson et al., Citation2007; Hunsley & Bailey, Citation1999; Lilienfeld et al., Citation2000; Wood et al., Citation2015). Researchers who want to examine the published studies complain that they can frequently not obtain the relevant quantitative material except by purchasing raw data from the original researchers (Wood et al., Citation1996). It is also the case that a researcher’s request to examine data was refused even though payment was offered (see Garb et al., Citation2020). Other researchers have criticised the methodology of several key studies and pointed out that half of the studies with positive findings were unpublished and, therefore, had not been peer reviewed (e.g. Lilienfeld et al., Citation2000). Lilienfeld et al. (Citation2000) have concluded that despite the volume of research there is a lack of peer-reviewed research that examines the empirical foundation of the Rorschach in general and the Comprehensive System in particular.

Validity of test must be appropriate for the legal question

The organic development of the Rorschach means that it does not have a specific standardisation population, and even supporters of the Rorschach such as Meyer et al. (Citation2015) acknowledge that developers of norms have not given enough attention to the possible effects of the age, gender, culture, education, intelligence and linguistic and socioeconomic background of examinees, and that some of the findings are in conflict. For example, Giromini et al., (Citation2017) found that demographic variables do not influence Rorschach scores, whereas others found significant effects (Delavari et al., Citation2013; Meyer et al., Citation2007).

Various features of the Rorschach could influence its validity in cultural and language groups. Emotion–colour associations, for instance, vary across cultures, with envy associated with black, purple and yellow in Russia and black, red and green in the United States (US; see Hupka et al., Citation1997). French examinees often see a chameleon in card VIII, Scandinavian examinees often see Christmas elves in card II and Japanese examinees often provide a musical instrument-related answer to card VI. All of these answers are unusual responses according to the Comprehensive System (Weiner, Citation2014). The attempt by Meyer et al. (Citation2007) to overcome the problems of using the Rorschach is welcome but the norms have been criticised on several grounds (see Gurley, Citation2017; Meyer et al., Citation2017).

Several meta-analyses have compared the Rorschach’s construct validity with that of the Minnesota Multiphasic Personality Inventory (MMPI; e.g. Garb et al., Citation1998; Hiller et al., Citation1999). These studies have typically shown that the Rorschach’s variables or indexes have a low average validity of about .30, and further that some variables do not correlate with corresponding MMPI variables (see Krishnamurthy et al., Citation1996; Lilienfeld et al., Citation2000). However, due to the publication bias and methodological artefacts observed in these meta-analyses, the .30 value may be overestimated (see Lilienfeld et al., Citation2000). Researchers have failed to find evidence that the Rorschach can identify a specific psychological problem (e.g. Bartell & Solanto, Citation1995; Kaplan & Saccuzzo, Citation2017).

When Comprehensive System variables were correlated with externally (e.g. psychiatric diagnosis) and introspectively (e.g. self-report questionnaires) assessed criteria, the mean validities were .27 and .08, respectively (Mihura et al., Citation2013). Mihura et al. (Citation2013) found minor support for the validity of Comprehensive System indices such as Severe Perceptual-Thinking, Suicide Constellation and Cognitive Mediation (at least .33), but little or no support for indices like Aggressive Movement, Egocentricity and Coping Style. Wood et al. (Citation2015) replicated the Mihura et al. (Citation2013) study but included the data from unpublished studies to counter the impact of publication bias. They found that several Comprehensive System scores based on complexity/synthesis and productivity differentiate between normal and affected populations. They, however, found no evidence supporting a relationship between Comprehensive System indexes and non-cognitive characteristics such as negative affect and emotionality. Lilienfeld et al. (Citation2000) have further reported that the correlations between the Comprehensive System scores and most psychiatric diagnoses were not replicated in later studies.

The validity of the Rorschach overall is not good (Kaplan & Saccuzzo, Citation2017). It appears to be useful in diagnosing bipolar disorder, schizophrenia and schizotypal personality disorder (Wood et al., Citation2000). It is, however, less useful in diagnosing post-traumatic stress disorder (PTSD) and other anxiety disorders, major depressive disorder, suicide attempts, dissociative identity disorder, psychopathy, antisocial personality disorder and dependent, narcissistic (see Exner, Citation1995) or conduct disorders (see Carlson et al., Citation1997; Hunsley et al., Citation2015; Wood et al., Citation2015). Even examiners who use the Comprehensive System’s norms can over-pathologise people (Cooke & Norris, Citation2011; Costello, Citation1998), especially those from lower socio-economic groups (Frank, Citation1994) and children (Hamel et al., Citation2000). Research also suggests that the Rorschach is not a valid instrument for assessing impulsiveness, criminal behaviour or tendency toward violence, or for detecting child sexual abuse (Lilienfeld et al., Citation2001). Suggestions that references by examinees to buttocks, anuses, feminine clothing and sex organs in their responses predict homosexuality (e.g. Seitz et al., Citation1974) have never been supported by any scientific evidence (see APA, Citation2009).

The incremental validity of projective techniques is generally poor (Garb et al., Citation2003; Miller & Nickerson, Citation2006; Ruscio, Citation2000). This is also true for the Rorschach, with little evidence that most Comprehensive System scores provide information beyond the data gathered by other psychological instruments (Wood et al., Citation2015).

The Rorschach’s lack of a method for checking response style is not a concern for most of its supporters because they believe it is immune to attempts to fake good or bad responses (for references, see Acklin, Citation2007). For others, the Rorschach is at best relatively resistant (Grossman et al., Citation2002) or at worst vulnerable to faking attempts that might not be detected through the Rorschach indices (e.g. Elhai et al., Citation2004; Hartmann & Hartmann, Citation2014). The Rorschach performance of malingerers were compared to those of patients with schizophrenia (see Albert et al., Citation1980), psychosis (Albert et al., Citation1980), depression (Meisner, Citation1984) and PTSD (Frueh & Kinder, Citation1994), for example. Overall, the results show that only a few Rorschach variables differ significantly between the participants who malingered and those in the control groups (see Frueh & Kinder, Citation1994). In addition, the differences found between role-informed malingerer versus control and patient groups in some studies (e.g. Frueh & Kinder, Citation1994; Overton, Citation1984) could be explained by prior training, knowledge about the pathology and information about how to behave during administration of the test (Albert et al., Citation1980; Overton, Citation1984). Albert et al. (Citation1980) furthermore found that fellows of the Society for Personality Assessment, who could be considered Rorschach experts, could only accurately classify 9% of informed fakers in the malingerer group (see also Sewell & Helle, Citation2018).

A major problem with the Rorschach is that interpreters put weight on details such as the area or location of the inkblot that examinees focus on when they respond. There are, however, many unknown internal and contextual factors that could influence examinees’ perceptions and therefore their responses. For instance, mood (see Matlin, Citation2012) and affect (e.g. Anderson et al., Citation2012; Kleckner et al., Citation2017) influence people’s perception of stimuli and judgements of the actual content of perceptions. This phenomenon of selective and context-dependent focus is well known because it is also evident in psychotherapy (Matt et al., Citation1992), eyewitness memory recall (Loftus, Citation2004) and everyday situations (see Loeffler et al., Citation2013). More specifically, Kingery (Citation2003) found that there is a positive correlation between dispositional negative affect and the selection of specific parts of the inkblots. Any deduction made based on the locations that examinees focus on therefore carries little weight without research that excludes the possibility that factors such as situational-induced mood rather than some underlying personality factor influenced the relevant response.

Reliability must be appropriate for the purpose for which test will be used

A problem with the Rorschach as alluded to earlier is that there are several methods of administering it and of coding and interpreting the results (see Bornstein & Masling, Citation2005). The administration is in many approaches deliberately vague (e.g. Siipola & Taylor, Citation1952) in order to allow examinees to respond freely to the ambiguous stimuli. This is problematic because researchers have found that differences in administration can significantly influence examinees’ responses (e.g. Lis et al., Citation2007) and that it can involve something simple, such as whether the cards are presented vertically or horizontally (e.g. D. M. Beck et al., Citation2005). The outcomes of the different scoring systems can also lead to significantly different interpretations (Kaplan & Saccuzzo, Citation2017).

Rorschach supporters have traditionally relied on experience and/or training, but neither can substitute for clear, structured and empirically supported instructions (e.g. President’s Council of Advisors on Science and Technology, Citation2016). The best method for minimising error is the scientific method and not clinical experience (Garb et al., Citation2016), as it is examiners’ subjectivity that influences the interpretation of data sets (e.g. Dimsdale, Citation2015). Harrower (Citation1976) claimed to have shown this by requesting that 10 examiners analyse and identify 17 anonymised Rorschach records of Nazi war criminals that were mixed up with non-Nazis ones. The outcome was that the examiners performed no better than chance and were thus not able to discriminate Nazi war criminals’ Rorschach protocols from non-Nazi ones. A limitation of this study is that the examiners did not have the background information that they would generally have in a clinical or psycholegal context (Groth-Marnat & Wright, Citation2016). While trying to reduce the impact of examiners’ subjectivity in the coding process by providing working tables, Exner (Citation2008) conceded that it would be naïve to expect objective coding (see also Acklin et al., Citation2000).

Research conducted in the 1980s and 1990s therefore predictably yielded contradicting and confusing results, but Parker’s (Citation1983) meta-analysis of 39 research papers nevertheless found an overall internal reliability coefficient of .83. Garb et al. (Citation1998) criticised Parker’s method, but methodologically sound research has made compatible findings with median test-retest coefficients of .80 reported (e.g. Acklin et al., Citation2000). The interrater reliability of the various Comprehensive System variables is even slightly better, with Exner (Citation2003) reporting that the interrater scoring reliability of all variables was above .85 and Meyer’s (Citation1997) meta-analysis of 16 studies finding interrater reliability coefficients ranging from .72 to .96. Other researchers, however, have found lower coefficients (e.g. W. Perry et al., Citation1995; Wood et al., Citation1996) and criticised Meyer’s study as flawed. Acklin et al. (Citation2000) found that half of the Comprehensive System variables attained reliability coefficients of .85 or above whilst some of the others, such as the Schizophrenia Index (SCZI), are as low as .45 with a median retest coefficient of .80. A later study by Sultan et al. (Citation2006), who retested 75 French non-patient adults after 3 months using the 47 Comprehensive System variables, found that 9 coefficients were above .70 and 21 yielded moderate reliability of above .50, while the median correlation coefficient was .53.

Availability of test and supporting documentation

The Rorschach is commercially available to examiners, and each of the systems give administering, scoring and coding instructions (see Bornstein & Masling, Citation2005). It serves little purpose to examine the supporting documentation of each system, but Exner’s (Citation2008) workbook for the extensively used Comprehensive System is arguably the most informative. It provides a standardised method of administration, scoring and interpretation, as well as information about the test’s psychometric properties and population norms. Several researchers have nevertheless criticised the composition of the normative sample (Hibbard, Citation2003; Viglione & Giromini, Citation2016).

Ethical requirements

The first ethical question regarding the Rorschach is whether or not it is generally accepted by psychologists as representing best practice and as being appropriate for psycholegal practice. It is difficult to give a definite answer because it is one of the most divisive tests used in psychology, with many supporters and detractors (Groth-Marnat & Wright, Citation2016). In making an award for distinguished professional contributions to John E. Exner, the APA (Citation1998) described the Rorschach as probably the most powerful psychometric tool ever invented, and many others share the belief that it is the equivalent of a psychological X-ray test (e.g. Brown, Citation1992; Piotrowski, Citation1980). Others criticise it as a pseudoscience (e.g. Fontan & Andronikof, Citation2017; Wood et al., Citation2008) and criticise its use in clinical (e.g. Huberty, Citation2019; Hunsley & Bailey, Citation2001; Lilienfeld, Citation2015) and forensic (e.g. Garb, Citation1999; Wood et al., 2001) settings due to its poor psychometric characteristics. Surveys of forensic practitioners in North America have nevertheless revealed that the Rorschach was the most frequently used unstructured projective test, with up to 36% of research participants reporting that they had used it (see Archer et al., Citation2006). More recent North American research, however, shows that the frequency of Rorschach use has dropped down to 20% (Viljoen et al., Citation2010) or even to 3% (Neal & Grisso, Citation2014). The acceptance of the Rorschach amongst forensic psychologists in other countries might never have been good. Martin et al. (Citation2001), for instance, found in their survey that Australian psychologists doing court work used the Rorschach less frequently than other tests; it had a weighted use score of only 23 compared to the 141 score of the Wechsler Intelligence Scale and the 98 of the Minnesota Multiphasic Personality Inventory-2 (MMPI-2).

The limitations of the Rorschach discussed above further raise questions about how confident psychologists who use it can be that the data they provide to courts are objective, relevant and reliable (see Erickson et al., Citation2007; Iudici et al., Citation2015; Neal et al., Citation2019). They might therefore find it difficult to demonstrate that they were taking reasonable steps to prevent reasonably foreseeable and preventable harm such as was present in F v. Hungary (Citation2018), where the applicant ran the risk of being returned to a country where he could have been persecuted.

The governance of the Rorschach is also in question because there is so much information available about it in publications and the media. There are over 20 million references associated with the term ‘Rorschach’ on the Internet and it has its own English Wikipedia page (Wikipedia, Citation2021), with descriptions of the most common responses extracted from Weiner (Citation2003) and Weiner and Greene (Citation2007). Schultz and Loving (Citation2012) concluded that 19% of the Rorschach-related information they found on the Internet posed a direct threat to the security of the test. Schultz and Brabender (Citation2013) assessed the influence of the Rorschach Wikipedia page’s content on the protocols of non-patients who were instructed to act as parents who want to keep their children, and thus, fake-good. They showed that, relative to an uninformed (control) group, well-educated non-patients benefited from exposure to Wikipedia content. Researchers were, however, not able to replicate these finding with psychiatric outpatients (Hartmann & Hartmann, Citation2014) or incarcerated violent offenders (Nørbech et al., Citation2016). Patients in the informed group failed to present as mentally well adjusted on the Rorschach but they were able to inhibit provocative, aggressive and dramatic responses relative to other patient groups (Hartmann & Hartmann, Citation2014).

The use of the Rorschach in court further weakens the credibility of the profession and undermines courts’, lawyers’ and the public’s perception of the profession as being trustworthy. This is demonstrated by remarks made by the judges in F v. Hungary (Citation2018), lawyers who commented on the case (e.g. Ferreira & Venturi, Citation2018) and journalists from reputable outlets such as the New York Times (Goode, Citation2001), New Scientist (Wilson, Citation2020) and BBC News (‘What’s behind the Rorschach inkblot test?’, Citation2012).

A feature of the administration of the Rorschach which requires consideration is that examiners often repeat requests for extensive answers, and may make negative inferences if examinees fail to provide enough information (Exner, Citation2008). There is a risk that examinees in forensic contexts will therefore feel coerced by examiners, whom they perceive as authority figures. It is notable that the judges in F v. Hungary (Citation2018) found that consent for psycholegal assessment in that specific type of case can never be voluntary.

Discussion

F v. Hungary (Citation2018) is a reminder to psychologists who do psycholegal work of the impact their reports can have on the rights and interests of people and that lawyers and courts will therefore approach their reports and testimony critically. Psychologists should consequently ensure that they use tests which can withstand the scrutiny of both their peers and lawyers. Some might therefore regret that the judges in F v. Hungary (Citation2018) did not set out clear standards that can be used to consider the credibility of a test – but it does nevertheless provide the profession with an opportunity to formulate and publish a set of generally accepted standards that they and lawyers can refer to when assessing the acceptability of a test. We believe that most psychologists who undertake forensic work might consciously or non-consciously apply the standards that we have identified when deciding whether or not to use a specific test to collect data. We deliberately did not add details such as what would be an acceptable test-retest correlation coefficient for two reasons. First, it would be more appropriate to do so after a debate by psycholegal professionals. Second, it is always at a court’s discretion whether or not it allows testimony, and courts will therefore not be bound by our standards or by any details that the profession adds. Courts and lawyers will nevertheless most likely consider standards and guidelines that represent psychologists’ general views.

The proposed standards are primarily aimed at assessing the appropriateness of emerging theories, techniques and tests in court, which makes their application to a well-established and well-researched test such as the Rorschach more difficult but nevertheless not impossible. A major concern regarding the Rorschach is that it lacks a theoretical basis supported by empirical data (e.g. Krishnamurthy et al., Citation2011) and that it is rather based on subjective common-sense thinking than on scientific research. The strategies of Exner (Citation2008) and others to develop norms and train examiners is no substitute for an empirically tested theory (President’s Council of Advisors on Science and Technology, Citation2016) and is contrary to the contemporary emphasis on evidence-based methods in assessments (see Stewart & Chambless, Citation2007).

The Comprehensive System’s construct validity is another major problem because a test does not have a place in psycholegal work if it does not measure what it is supposed to measure. Groth-Marnat and Wright (Citation2016) concluded that the overall validity of the Rorschach is moderate (between .30 and .50). It is difficult to draw a conclusive finding about the validity of many Comprehensive System indices because some have not yet been studied well, and the accuracy of the studies that have examined the validity of other indices is contested (Garb et al., Citation2005). An expert witness will therefore find it difficult to justify using even the Comprehensive System, especially given that it lacks an acceptable method of measuring response style and that research data show that the Rorschach is susceptible to faking (e.g. Perry & Kinder, Citation1990).

The test-retest reliability coefficients of the Rorschach indices that range from .27 to .94 (see Meyer, Citation1997) limit its application in forensic work where the stakes for people are high, and this could explain why Rorschach experts disagree when they give forensic opinions (see Guarnera et al., Citation2017). The Rorschach meets the availability standard and Exner’s (Citation2008) book provides most of the information required regarding the Comprehensive Systems.

A major hurdle for anybody who plans to use the Rorschach in court is, however, whether or not it would be ethical to do so. Our assessment is that there is no general acceptance amongst forensic psychologists internationally that it is a test that should be used in court (e.g. Archer et al., Citation2016). Psychologists who want to use the Rorschach test should tell courts that it is highly controversial and not widely accepted (Lilienfeld et al., Citation2000), but this is unlikely to happen.

The evidence we have reviewed further suggests that the Rorschach does not provide objective, relevant and reliable data and therefore that psychologists who use it cannot exclude the possibility that their reports and testimony will cause reasonably foreseeable and avoidable harm. The governance of the test is also poor, and this is particularly concerning given that it lacks a method of measuring response style. Psychologists’ use of the Rorschach in court further appears to impact on the credibility of the whole profession and might constitute non-proportional violation of examinees’ autonomy. Our overall conclusion is therefore that – despite its popularity in some European countries (see Areh, Citation2020) – the Rorschach does not meet the standards which we believe the international scientific community requires of tests used in psycholegal work, and consequently that psychologists should not use it in court (but for a contrary view, see Weiner et al., Citation1996).

In conclusion, psychologists should only use tests whose reliability and validity have been demonstrated rigorously. We expect that some other commonly used techniques and tests might also be found wanting if the standards set out above are applied to them. We do not think that these standards are unnecessarily arduous because if psychologists expect the public and more specifically legal professionals to trust them and their testimony, they must ensure that they use tests and techniques that are trustworthy and that do not violate the human rights of the persons assessed. In the case of F v. Hungary (Citation2018), the Court of Justice of the European Union ruled that an expert’s report is only acceptable if it is based on sufficiently reliable methods which meet standards recognised by the international scientific community. The Court did not convey what these standards are or decide whether or not projective tests are acceptable. This might be due to the Court’s trust in psychology as a profession, and other sciences, to fulfil its expectations. Should the profession fail to do so, a court could set standards for psychologists or it might decide that psychological expert opinions are not reliable enough to be received as scientific evidence.

Ethical standards

Declaration of conflicts of interest

Igor Areh has declared no conflict of interests.

Fanny Verkampt has declared no conflict of interests.

Alfred Allan has declared no conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Acknowledgements

The authors thank Dr Maria Allan for her comments on drafts of the paper.

References