2,286
Views
32
CrossRef citations to date
0
Altmetric
ORIGINAL ARTICLE

Data quality and confirmatory factor analysis of the Danish EUROPEP questionnaire on patient evaluation of general practice

, &
Pages 174-180 | Received 17 Apr 2006, Published online: 12 Jul 2009

Abstract

Objective. The Danish version of the 23-item EUROPEP questionnaire measuring patient evaluation of general practice has not been evaluated with regard to psychometric properties. This study aimed to assess data quality and internal consistency and to validate the proposed factorial structure. Setting. General practice in Denmark. Subjects. A total of 703 general practitioners (GPs). Some 83 480 questionnaires were distributed to consecutive patients aged 18 or more attending practice during the daytime. A total of 56 594 eligible patients responded (67.8%). Main outcome measures. Data quality (mean, median, item response, missing, floor and ceiling effects), internal consistency (Cronbach's alpha and average inter-item correlation), item-rest correlations. Model fit from confirmatory factor analysis (CFA). Results. The distribution was skewed to the left for almost all items with a small floor effect (0.1–9.3%) and a ceiling effect larger than 15% (18.6–56.3%). Item response was high. For seven items “not applicable/relevant” represented more than 10% of the answers. Internal consistency was good. Item-rest correlations were below 0.60 for three items, and four items had lower correlations with their own domain than with other domains. CFA showed that four domains were highly correlated and that model fit was good for two indices (TLI and SRMR), acceptable for one index (CFI), and poor for three indices (chi-squared, RMSEA and WRMR). Conclusions. This study revealed high ceiling effects, a few items with low item-rest correlation and low item discriminant validity, and an uncertain model fit. There seems to be a need for developing response categories to bring down the ceiling effect and it is also unclear how to use the proposed domains. Future research should focus on evaluating the factorial structure when ceiling effect has been lowered, on whether items should be deleted, and on assessing the unidimensionality of each domain.

Patient evaluation of healthcare services contributes to the basis of quality development in the healthcare sector Citation[1], Citation[2], which, in turn, presupposes the availability of scientifically sound instruments facilitating valid and relevant data collection Citation[3], Citation[4].

In 1993, the EU BIOMED study EUROPEP was launched to develop an international patient evaluation questionnaire for general practice Citation[5], Citation[6]. As in other European countries, a national project based on this questionnaire, entitled “Danish Patients Evaluate General Practice” (DanPEP), was launched in 2002. This Danish version of the questionnaire was introduced as a tool for assessing patients’ evaluation of general practice and consequently psychometric issues arose. So far, the development had focused on aspects of especially content and construct validity and some on criterion validity Citation[5–10]. The questionnaire was developed so that each item provided information and not as sum-scales. However, the items were categorized into five qualitatively developed domains (doctor–patient relationship, medical care, information and support, organization of services, and accessibility).

Thus, the question remains whether these domains can be used to categorize the items and as sum-scales. Further, data quality and internal consistency must also be assessed Citation[11–13].

The aim of this study was to assess data quality and internal consistency and to validate the factorial structure with five domains of the EUROPEP questionnaire.

The EUROPEP evaluation questionnaire was developed using a comprehensive approach comprising literature studies, patient interviews, and priority and validation studies. However, we lack knowledge about its psychometric performance.

  • Data from more than 50 000 patient evaluations indicated good quality data, good internal consistency, and low floor effect.

  • The ceiling effect was consistently very high.

  • Two of six goodness-of-fit indices showed poor, two showed good, and one index showed acceptable fit of the questionnaire domains.

  • There is a need to develop the response categories and the factorial structure of the questionnaire.

Material and methods

The questionnaire

The EUROPEP questionnaire was developed on the basis of a systematic literature review, patient interviews, and an empirical study designed to determine patient priorities Citation[6–8]. A cross-national validation study of the 23-item questionnaire was performed before its introduction Citation[9], Citation[10].

The patients were asked to evaluate their GP on a five-point scale (1–5) ranging from “poor” (1), through “acceptable” (3) to “excellent” (5) (2 and 4 had no text). Alternatively, patients could use the response option “Not applicable/relevant” Citation[5]. All items were scored in the same direction. The EUROPEP questionnaire was not developed as a rating scale, but as a collection of several equal items (“indicator” variables Citation[11]) measuring a construct.

GP and patient populations

From 2002 to 2005, 703 GPs participated in the DanPEP survey. Each GP personally distributed questionnaires to a number of consecutive patients aged 18 or more attending the surgery and able to read and understand Danish. The GPs were divided into three groups because other aims of the project were to study the effect of reminders and of using postal questionnaires to patients. Hence, 121 GPs each distributed 130 questionnaires to consecutive patients Citation[14], 391 GPs each distributed 100 questionnaires to consecutive patients, whereas the questionnaires from 191 GPs were sent directly to 150 patients by the secretary. A total of 83 480 questionnaires were distributed. Non-responders from the two latter groups received a reminder from the DanPEP secretariat 3–5 weeks after the consultation Citation[9]. Patients completed the questionnaire at home and returned it to the DanPEP secretariat. A total of 56 594 eligible patients responded (67.8%). Questionnaires were produced and optically scanned using Cardiff TELEform.

Analyses

The analyses consisted of two parts: First, we assessed the data quality, internal consistency and correlations between items and domains and between domains. Second, we explored the five-domain structure using confirmatory factor analysis.

Analyses were performed with Stata 9 and factor analysis with Mplus 4.1 Citation[15]. The analyses included questionnaires in which at least 50% (12 + ) of the items had been answered.

Data quality was assessed in terms of mean with standard deviation, median, percentage of missing data, number of “not applicable/relevant” answers and extent of ceiling and floor effects. Floor and ceiling effects between 1% and 15% were defined as optimal Citation[16]. Internal consistency was assessed using Cronbach's alpha and average inter-item correlation. Only answers on the 1–5-point scale were included. We defined an alpha of 0.80 as the lowest acceptable value Citation[3], Citation[4]. In contrast to alpha, the average inter-item correlation is independent of the number of items and sample size when measuring internal consistency. We regarded an average inter-item correlation of at least 0.50 as good Citation[12].

The correlation analyses assessed whether each item had a high correlation with the sum score of the rest of the scale (internal item convergence) and a higher correlation with the items in its own scale than with those of other scales (item discriminant validity) Citation[3], Citation[17], Citation[18]. Correlations were fixed at a minimum of 0.60 to reflect a high level of internal convergence Citation[11]. We defined a sufficient item discriminant validity as a correlation with the items in its own scale two standard deviations above that obtained with other scales (calculated as the 95% confidence intervals for the coefficients based on Fisher's transformation [Stata, ci2-option]) Citation[12].

The factorial structure was evaluated by confirmatory factor analysis (CFA) where items were analysed as categorical measures with a variance-adjusted weighted least-squares method (WLSMV) estimator. The objective of the CFA was to explore to what degree the correlations between the variables could be explained by the five domains (factors). Thus, we defined a basic model where an item was linked to its own domain (see ), with unspecified correlation between domains. A number of indices are available to assess the fit of a model based on categorical data and we present the six indices which have been shown to be useful in assessing model fit Citation[19–22].

  1. Chi-squared goodness-of-fit statistic assesses the discrepancy between the sample and fitted covariance matrix (the null hypothesis is that the model fits the data. An insignificant test indicates good fit (p-value above 0.1). The chi-squared is extremely sensitive to sample size (about 200 cases) and in large samples it tends to result in a rejection of the model. For this reason, additional fit indices that are less sensitive are recommended (using the non-centrality parameter and taking into account sample size and degrees of freedom).

  2. Comparative fit index (CFI) assesses fit relative to a null model and ranges from 0 to 1 with values of 0.90–0.95 indicating acceptable and over 0.95 good fit.

  3. Tucker Lewis index (TLI) adjusts for the number of parameters of the model and is interpreted as CFI.

  4. Root mean square error of approximation (RMSEA) expresses the lack of fit per degree of freedom of the model. Values are interpreted as follows: ≤0.05 indicates very good,>0.05–0.08 good and ≥0.10 poor fit.

  5. Standardized root mean square residual (SRMR) is the average of the differences between the observed and predicted correlations and has a range from 0 to 1. Values of less than 0.08 indicate good fit.

  6. Weighted root mean square residual (WRMR) is also a residual based measure where values above 0.90 indicate poor fit.

Table I.  Data quality (mean value (mean), standard deviation (SD), median, item response, answers in the “Not applicable/relevant” category, missing answers, and answers in lowest (floor) and highest (ceiling)) and internal consistency (Cronbach's alpha and average inter-item correlation for the domain).

Results

Data quality analysis

For all items except 20, 21, and 22 the distribution was skewed to the left (). The item response was high with a small number of missing answers (0.3–1.7%). The response category “not applicable/relevant” was used by 0.3–30.7% of responders. For 7 items (6, 7, 11, 14, 17, 21, and 23) this category represented more than 10% of the answers. Floor effect was small (range 0.1–9.3%) and all items had a ceiling effect larger than 15% (range 18.6–56.3%).

Cronbach's alpha ranged from 0.80 to 0.92, and average inter-item correlation ranged from 0.64 to 0.72.

Item-rest correlation () (internal item convergence) (range 0.55–0.84) was below 0.60 for three items (6, 18, and 22). For four items (6, 11, 16, and 17) the correlation with own domain (item discriminant validity) was lower than with other domains. Four domains (dim 1 to dim 4) were highly correlated (ranged 0.86–0.92). These domains had lower correlations with the “Accessibility” domain.

Table II.  Correlations between domains and correlations between items and (1) the rest of the items in its own scale (item-rest correlation) and (2) the other domains (Dim 1 to Dim 5).

Confirmatory factor analysis

The CFA showed high factor loading for all items (). The indices for model fit () show that the model fits the data well for two indices (TLI and SRMR), acceptably fit for one index (CFI), and poorer for three indices (chi-squared, RMSEA, and WRMR).

Table III.  Results of the confirmatory factor analysis showing the standardized factor loadings and standardized residuals for each item when modelled with its own domain.

Table IV.  Model statistics of the CFA of the Danish version of the EUROPEP questionnaire (n = 20 072).

Discussion

In agreement with other authors we found a skewed distribution for almost all items Citation[23]. Consequently, the ceiling effect was high indicating that the full evaluation range was not captured Citation[16], which may lower the responsiveness of the questionnaire Citation[24]. This obviously calls for a change in the response categories as the ceiling effect was seen for all items. We saw a small number of missing responses, but for seven items more than 10% of respondents found the questions irrelevant or not applicable. These items represented situations where respondents were supposed to have experienced health problems and thus they were not relevant to all respondents in general practice. Answers to these questions may consequently be divided depending on which patients respond.

We found a few items with low item-rest correlation and higher correlations to other domains than their own. Thus, it is possible that some items should be assigned to another domain. We also saw high correlations between four of the five domains, which may be a result of cross-correlation. Still, we found a high internal consistency of the domains which indicates that the items behave reasonably similarly within the domains. However, this would also be seen if there was a high cross-correlation between items from different domains.

The high correlations between domains seem to be confirmed by the CFA. The proposed model of the EUROPEP questionnaire did not fit the data for each goodness-of-fit measure. The chi-squared was expected to reject the model because of the high sample size. Even if we ignored this test, two indices showed good, two poor, and one acceptable fit. Thus, we are not able to determine the fit of the proposed factorial structure. We made additional analyses (not shown) with different values for correlations between the domains, but these models did not improve the results compared with the basic model. We also considered removing one or more items, but as seen in , no item had a particularly low loading. The goodness-of-fit indices may, however, have been strongly affected by the highly skewed response distributions.

Strengths and weaknesses

We obtained a high response rate, minimizing the risk of selection bias due to dropout. The questionnaires were handed out personally by the GPs, which may have given the GPs the possibility of excluding some patients (e.g. those with the most negative attitudes) from the study. The patients were included when attending the surgery, and frequent attenders were thus more likely to be included. However, this selection method ensured that the patients were able to evaluate the GP, which in turn enabled them to better answer the questions. These issues of selection bias may have affected the actual scores, but are unlikely to have changed the factorial structure of the questionnaire. We have previously shown that although non-responders may evaluate their GPs more negatively, this would only result in a small change in the complete evaluation Citation[14].

However, in addition to providing very high statistical precision the large sample size also led to narrow confidence intervals for correlation coefficients and affected Cronbach's alpha, which is sample-size sensitive (there is no simple rule here, but the results should be treated with caution if alpha is below 0.85 in a large sample Citation[11]).

Implications and future research

This study revealed problems with the Danish version of the EUROPEP questionnaire: high ceiling effects, some items with low item-rest correlation and low item discriminant validity, and an uncertain model fit of the proposed factorial structure. Consequently, there seems to be a need to improve the Danish version of the questionnaire, and the study especially emphasizes the need to develop new response categories to lower the ceiling effect. Subsequently, there also is a need to assess the factorial structure of the questionnaire again due to changed variances and to explore possible composite subscale scores and how the categorical response categories behave based on item response theory. Further, future research should focus on trying to provide an external anchor which would make it possible to analyse whether the indices (multi-item model) can be used without loss of information compared with the use of a single-item model, evaluate whether items should be deleted and finally assess the unidimensionality of each domain.

Acknowledgements

The authors would like to thank Ms Gitte Hove, MLISc, for data management. This part of the DanPEP project was funded by the Central Committee on Quality and Informatics in General Practice, the Ministry of the Interior and Health, and the regional committees on quality development in the counties of Aarhus, Frederiksborg, Funen, Southern Jutland, and Vejle and the municipalities of Copenhagen and Frederiksberg. They are grateful to the many patients and their general practitioners for their contributions to this study.

References

  • Wensing M, Baker R. Patient involvement in general practice care: a pragmatic framework. Eur J Gen Pract 2003; 9: 62–5
  • Anden A, Andersson SO, Rudebeck CE. Concepts underlying outcome measures in studies of consultations in general practice. Scand J Prim Health Care 2006; 24: 218–23
  • Sitzia J. How valid and reliable are patient satisfaction data? An analysis of 195 studies. Int J Qual Health Care 1999; 11: 319–28
  • McDowell I, Newell C. Measuring health: A guide to rating scales and questionnaires. Oxford University Press, Oxford 1996
  • Wensing M, Mainz J, Grol R. A standardised instrument for patient evaluations of general practice care in Europe. Eur J Gen Pract 2000; 6: 82–7
  • Grol R, Wensing M. Patients evaluate general/family practice The EUROPEP instrument. EQuiP, WONCA Region Europe, 2000.
  • Wensing M, Mainz J, Ferreira PL, Hearnshaw H, Hjortdahl P, Olesen F, et al. General practice care and patients’ priorities in Europe: An international comparison. Health Policy 1998; 45: 175–86
  • Grol R, Wensing M, Mainz J, Ferreira P, Hearnshaw H, Hjortdahl P, et al. Patients’ priorities with respect to general practice care: An international comparison. European Task Force on Patient Evaluations of General Practice (EUROPEP). Fam Pract 1999; 16: 4–11
  • Grol R, Wensing M, Mainz J, Jung HP, Ferreira P, Hearnshaw H, et al. Patients in Europe evaluate general practice care: An international comparison. Br J Gen Pract 2000; 50: 882–7
  • Mainz J, Vedsted P, Olesen F. How do the patients evaluate general practitioners? (in Danish). Ugeskr Laeger 2000; 162: 654–8
  • Fayers PM, Machin D. Quality of life: Assessment, analysis and interpretation. Wiley, Chichester 2000
  • Streiner DL, Norman GR. Health measurement scales. A practical guide to their development and use. Oxford University Press, New York 2003
  • Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med 2006; 119: 166
  • Heje HN, Vedsted P, Olesen F. A cluster-randomized trial of the significance of a reminder procedure in a patient evaluation survey in general practice. Int J Qual Health Care 2006; 18: 232–7
  • Muthén LK, Muthén BO. Mplus user's guide. Muthén & Muthén, Los Angeles 2007
  • McHorney CA, Tarlov AR. Individual-patient monitoring in clinical practice: Are available health status surveys adequate?. Qual Life Res 1995; 4: 293–307
  • Foster SL, Cone JD. Validity issues in clinical assessment. Psychol Assess 1995; 7: 248–60
  • Ware-JE J, Gandek B. Methods for testing data quality, scaling assumptions, and reliability: The IQOLA Project approach. International quality of life assessment. J Clin Epidemiol 1998; 51: 945–52
  • McDonald RP, Ho MHR. Principles and practice in reporting structural equation analyses. Psychological Methods 2002; 7: 64–82
  • Muthen B, Kaplan D. A comparison of some methodologies for the factor-analysis of non-normal Likert variables. Br J Math Stat Psychol 1985; 38: 171–89
  • Bentler PM. Comparative fit indexes in structural models. Psychol Bull 1990; 107: 238–46
  • Hu L-T, Bentler PM. Cutoff criteria for fit indices in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal 1999; 6: 1–55
  • Milano M, Mola E, Collecchia G, Carlo AD, Giancane R, Visentin G, et al. Validation of the Italian version of the EUROPEP instrument for patient evaluation of general practice care. Eur J Gen Pract 2007; 13: 92–4
  • De Vet HC, Terwee CB, Bouter LM. Current challenges in clinimetrics. J Clin Epidemiol 2003; 56: 1137–41

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.