Abstract
This article describes two separate, related studies that provide insight into the effectiveness of e-rater score calibration methods based on different distributional targets. In the first study, we developed and evaluated a new type of e-rater scoring model that was cost-effective and applicable under conditions of absent human rating and small candidate volume. This new model type, called the Scale Midpoint Model, outperformed an existing e-rater scoring model that is often adopted by certain e-rater system users without modification. In the second study, we examined the impact of three distributional score calibration approaches on existing models’ performance. These approaches included percentile calibrations on e-rater scores in accordance with a human rating distribution, normal distribution, and uniform distribution. Results indicated that these score calibration approaches did not have overall positive effects on the performance of existing e-rater scoring models.
Keywords:
Acknowledgments
Any opinions expressed in the article are those of the authors and not necessarily those of Educational Testing Service.
The authors would like to thank Brent Bridgeman and Alina von Davier for providing their expertise in automated scoring, educational measurement, and statistics.