Abstract
A shortened version of the German adaptation of the morningness–eveningness scale of Horne and Östberg is analysed within a large sample of 994 physicians with respect to dimensionality, reliability, gender differences and validity. The psychometric analysis – which incorporates a highly robust method to check for unidimensionality – shows discrepancies towards unidimensionality and highlights three misfitting items. In addition, hypothesis testing indicates the presence of differential item functioning (DIF) with respect to gender which could be caused by differences in response formats. Although, reliability estimates are satisfactory, an overall lack of adequate psychometric properties of the scale within the population of physicians has to be reported. We derive suggestions for improvement of the original morningness–eveningness questionnaire (MEQ)-scale and provide general comments on how to check for unidimensionality without imposing a restrictive response model.
ACKNOWLEDGEMENTS
We thank Dr. Ralf Wegner, ZfAM, for providing the data set of the study for this analysis. He, Astrid Richter and Peya Kostova recruited the participating physicians and assessed the data.
DECLARATION OF INTEREST
The authors report no conflicts of interest and are alone responsible for the content of the paper.
Notes
1In studies which employ an ad hoc-recruitment of subjects, the validity of the statistical results is jeopardized the probabilities of selecting units into the sample are unknown.
2As the study collected a rich variety of data on, e.g. family background, workability (WAI), burn-out-symptoms and working conditions, there was a necessity to use a shortened version in order to ensure that participants could complete the survey within a reasonable time frame.
3For the sake of completeness we note that α = 0.76 (Cronbachs Alpha) if the original scoring proposed by Horne & Östberg (Citation1976) is used.
4For a precise description on how results of item response theory are linked to models of classical test theory we refer the reader to Holland & Hoskens (Citation2003).
5This is the reason why we did not report violations which were based on a sample size of less than 30.
6 We did not collapse the score of the three subscales into a single index – as the scores refer to different latent factors. Moreover, testing each of the three subscales separately would call for proper control of multiple testing – with potentially severe drawbacks for the power to detect any interaction.
7 In a subsequent explanatory analysis we also checked for nonlinearity by including quadratic terms in the regression model. We noticed however no substantial gain in explanatory power.
8 This follows from the fact that the basic building block of the method rests on three items.
9A drastic example of this phenomenon is cited on page 20 in Agresti (Citation2009).