Abstract
I studied rater effects in the writing and speaking sections of the Test of German as a Foreign Language (TestDaF). Building on the many-facet Rasch measurement methodology, the focus was on rater main effects as well as 2- and 3-way interactions between raters and the other facets involved, that is, examinees, rating criteria (in the writing section), and tasks (in the speaking section). Another goal was to investigate differential rater functioning related to examinee gender. Results showed that raters (a) differed strongly in the severity with which they rated examinees; (b) were fairly consistent in their overall ratings; (c) were substantially less consistent in relation to rating criteria (or speaking tasks, respectively) than in relation to examinees; and (d) as a group, were not subject to gender bias. These findings have implications for controlling and assuring the psychometric quality of the TestDaF rater-mediated assessment system.