1,295
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Medical education research study quality instrument: an objective instrument susceptible to subjectivity

& ORCID Icon
Article: 2308359 | Received 28 Aug 2023, Accepted 17 Jan 2024, Published online: 24 Jan 2024

ABSTRACT

Background: The medical education research study quality instrument (MERSQI) was designed to appraise medical education research quality based on study design criteria. As with many such tools, application of the results may have unintended consequences. This study applied the MERSQI to published medical education research identified in a bibliometric analysis. Methods: A bibliometric analysis identified highly cited articles in medical education that two authors independently evaluated using the MERSQI. After screening duplicate or non-research articles, the authors reviewed 21 articles with the quality instrument. Initially, five articles were reviewed independently and results were compared to ensure agreed upon understanding of the instrument items. The remainder of the articles were independently reviewed. Overall scores for the articles were analyzed with a paired samples t-test and individual item ratings were analyzed for inter-rater reliability. Results: There was a significant difference in mean MERSQI score between reviewers. Inter-rater reliability for MERSQI items labeled response rate, validity and outcomes were considered unacceptable. Conclusions: Based on these results there is evidence that MERSQI items can be significantly influenced by interpretation, which lead to a difference in scoring. The MERSQI is a useful guide for identifying research methodologies. However, it should not be used to make judgments on the overall quality of medical education research methodology in its current format. The authors make specific recommendations for how the instrument could be revised for greater clarity and accuracy.

Background

Medical education research and scholarship submissions continue to proliferate. Many editors of medical education journals have noted the increased number of submissions they receive, which exploded during the recent pandemic. Over the years, there have been attempts to address the quality of educational research, particularly as it relates to published work, peer-review processes, and study design [Citation1]. These did not specifically address educational research methodology, which critics have noted as the primary quality concern [Citation1,Citation2].

The MERSQI was designed to assess the quality of medical education research methodology and to offer guidance on medical education research design [Citation2]. The MERSQI specifically focuses on the quality of experimental, quasi-experimental, and observational studies. Upon closer inspection, the MERSQI scores research studies on multiple categories, such study design, sampling institutions and response rate, type of data, data analysis, validity and outcomes [Citation1]. Internal medicine and obstetrics and gynecology have tested its validity and usefulness in scoring literature [Citation3,Citation4]. Studies indicate that certain scores on the MERSQI are more likely to be published than others; specifically, average score for published studies was 10.7 (SD 2.5) compared to 9.0 (SD 2.4) for rejected studies [Citation5].

Although helpful for researchers as a guide for study design, a score should not dictate if a study should be conducted. Strict adherence to standardized instruments such as MERSQI may result in innovative research being overlooked. The senior author (GLBD) participated in a research group that was using the MERSQI for that purpose.

Another example of how recommended standardization may be used inappropriately is reflective in the recommendations introduced by Artino et al. [Citation6] for reporting survey studies. The senior author (GLBD) has experienced comments from journal submissions that rigidly accept these recommendations as absolutes, suggesting that every item on the checklist be met to be considered for revision. Other submissions that focused on program evaluation were mistaken as surveys simply because a popular web-based survey platform was used. When the reviewer saw the software, they indicated that our work needed to be rejected because we did not follow the published guidelines [Citation6].

Survey research reporting offers one example of measurement that results in unintended consequences if use of the checklist guidelines strays from its intended use. Although considering quality in medical education research is important, too much weight placed on a score may result in similar unintended consequences. The goal of this study is to apply the MERSQI to highly cited medical education articles and to see if there are significant discrepancies in how articles are rated, as well as see if the MERSQI would rate these highly cited articles below the average published rating.

Materials and methods

The MERSQI was applied to articles identified in a bibliometric analysis [Citation7]. The MERSQI was constructed using literature to guide the item language. The final MERSQI includes 10 items, which can be used to identify an overall score. However, the interpretation of scores should focus on item-specific codes as opposed to an overall assessment of quality [Citation1,Citation5].

Bibliometric analyses are used to identify the most frequently cited articles in a particular field of study [Citation8]. Azar’s bibliometric analysis was selected because it identified medical education articles based on the number of citations and keywords, making for a comprehensive review [Citation7]. The bibliometric analysis included non-research articles, which were removed from the sample. Azar’s results included 112 medical education articles that were screened for evaluation using MERSQI. The article included two lists and duplicates from each list were screened out. The final list of articles included 11 from List A and 10 from List B for a total of 21 articles.

The authors have experience as researchers in various settings. Although one of the authors is a medical student, his experiences in undergraduate college as well as medical school have provided ample opportunities to understand research methodologies. The senior author is an experienced educational researcher with over 25 years of experience conducting educational research. Therefore, the research team has appropriate experience to analyze educational research methods by applying the MERSQI in a systematic way.

The authors independently evaluated articles identified from Azar’s study [Citation7] using the MERSQI. After completing five reviews, the authors met to ensure there was agreement about how to rate MERSQI items. For those items that were unclear, the research team discussed the article extensively and the MERSQI item to ensure mutual understanding of the MERSQI item. After this meeting, the researchers independently reviewed the remaining articles. The MERSQI form was set up in Qualtrics (Provo, UT) to collect the data.

For each item of the MERSQI, inter-rater reliability was calculated. For inter-rater reliability, a minimum of two reviews need to be conducted independently. We interpreted Cronbach’s alpha results as excellent ≥ .9, good = .8–.9, acceptable = .7–.8, fair = .6–.7, poor = .5–.6, <.5 unacceptable. Overall scores were analyzed to determine if they met the requirements for normal distribution. Data was normally distributed so the overall MERSQI scores were compared using independent samples t-test. Effect size was calculated using Cohen’s d (.2=small effect, .5=medium effect, .8=large effect) [Citation9].

Results

Each MERSQI item was analyzed for inter-rater reliability. Four of the items had good or excellent reliability (α≥.8). MERSQI items that are classified as Response rate, Validity, and Outcome were all considered unacceptable (α <.5) ().

Table 1. Inter-rater reliability of MERSQI items.

A total MERSQI score was calculated by reviewer for each article (). A paired samples t-test identified a significant difference in mean MERSQI score between reviewer 1 (M = 9.83, SD = 2.25) and reviewer 2 (M = 11.14, SD = 2.34); t(21) = 3.71, p < .001, d = .81).

Table 2. Total MERSQI score by reviewer.

Discussion

Based on our findings, scoring highly cited medical education research using the MERSQI failed to achieve consistent results in scoring. Response Rate, Validity, and Outcomes fell well below an acceptable Cronbach's alpha for inter-rater reliability. Additionally, overall MERSQI scores were significantly different. This calls into question whether or not this particular scale is useful for determining the quality of educational research methodology.

For Response Rate, it was challenging to determine when reviewing the articles, especially for manuscripts that were meta-analyses. The meta-analysis does not specify the response rate so it comes down to the rater on if they count the collection of data as a response rate or as a lack of response rate recorded. Although meta-analyses and other literature reviews are considered medical education research, using MERSQI to rate them was challenging, particularly for the Response Rate. Clarifying what constitutes a response in such studies may improve the reproducibility of the response rate variable of the MERSQI.

Understanding of validity evidence has evolved over time [Citation31]. For many reviewed studies, multiple sources of validity evidence were identified. However, we used a single-best option response, which may be counter to what the MERSQI creators envisioned [Citation5]. Our poor inter-rater reliability may have been mitigated had we checked all items that applied and calculated a grand sum. Based on our works, we recommend the MERSQI be revised to include more options for validity evidence as well as revise the scoring such that multiple items can be selected.

The lack of inter-rater reliability for Outcomes fell into the same trap as the Validity item. There were multiple outcome types reported, making it difficult to determine what was best. Although not shown in our data, we discussed the outcome ratings and adjusted scores after reaching agreement, which provided a slight improvement in inter-rater reliability. This particular item should also be revised to indicate that multiple options may be selected.

The statistically significant differences in overall MERSQI scores with a large effect size most likely was the result of inaccurate item scores. Four articles ranged from below 9 or above 10. As stated previously, research has shown that the average published score was greater than 10, and the average rejected score was equal to or below 9 [Citation5]. This highlights how differences in the interpretation of the MERSQI can lead to markedly different outcomes of being published or not if too much weight is placed on a particular checklist or scale.

Lastly, this data shows that using the MERSQI both raters graded 7 of the same articles from the bibliometric analysis below 9. If purely going off of MERSQI scores as a decision to accept, these articles may not have been published, and yet they are some of the most highly cited articles based on this bibliometric analysis [Citation7].

There have been other published recommendations seeking to standardize reporting survey research and qualitative research. Artino et al. [Citation6] introduced guidelines for reporting survey studies, which were previously mentioned. Additionally, O’Brien et al. [Citation32] produced a similar article for qualitative research. Both articles attempt to offer some form of standardized reporting and were not suggesting their criteria be used as judgment on the quality of the study being reported. Similarly, although there have only been examples from meetings [Citation3] uses the MERSQI to screen the merits of potential projects, we believe the MERSQI should be used to inform researchers of key components to include as they develop and report on their studies.

This study is limited in that we focused solely on one scale for analyzing already published articles. However, given that the MERSQI has been used as a means of determining the quality of medical education research [Citation3,Citation4], using the MERSQI to rate highly cited medical education research has uncovered some flaws with the instrument with regard to specific items. If an overall score continues to be used for screening purposes, revisions should be made to address the limitations posed by Response Rate, Validity, and Outcomes. It would also be beneficial to have additional scoring conducted using the MERSQI that involves multiple reviewers in order to conduct a generalizability study, which is more robust than inter-rater reliability [Citation33].

Conclusion

Our study has shown discrepancies between users’ ratings of the MERSQI checklist. Therefore, it is important to use the MERSQI as a guideline for appraising and designing research but not a tool to screen. Undertaking educational research project decisions should employ a holistic approach, which the MERSQI could contribute.

Disclosure statement

No potential conflict of interest was reported by the author(s)

Additional information

Funding

The author(s) reported that there is no funding associated with the work featured in this article.

References

  • Cook DA, Reed DA. Appraising the quality of medical education research methods: the medical education research study quality instrument and the Newcastle-Ottawa Scale-education. Acad Med. 2015;90(8):1067–5. doi: 10.1097/ACM.0000000000000786
  • Reed DA, Cook DA, Beckman TJ, et al. Association between funding and quality of published medical education research. JAMA. 2007;298(9):1002–1009. doi: 10.1001/jama.298.9.1002
  • Smith RP, Learman LA. A plea for MERSQI: the medical education research study quality instrument. Obstet Gynecol. 2017;130(4):686–690. doi: 10.1097/AOG.0000000000002091
  • Sawatsky AP, Beckman TJ, Edakkanambeth Varayil J, et al. Association between study quality and publication rates of medical education abstracts presented at the society of general internal medicine annual meeting. J Gen Intern Med. 2015;30(8):1172–1177. doi: 10.1007/s11606-015-3269-7
  • Reed DA, Beckman TJ, Wright SM, et al. Predictive validity evidence for medical education research study quality instrument scores: quality of submissions to JGIM’s medical education special issue. J Gen Intern Med. 2008;23(7):903–907. doi: 10.1007/s11606-008-0664-3
  • AR A Jr, Durning SJ, Sklar DP. Guidelines for reporting survey-based research submitted to academic medicine. Acad Med. 2018;93(3):337–340. doi: 10.1097/ACM.0000000000002094
  • Azer SA. The top-cited articles in medical education: a bibliometric analysis. Acad Med. 2015;90(8):1147–1161. doi: 10.1097/ACM.0000000000000780
  • Moed HF. New developments in the use of citation analysis in research evaluation. Arch Immunol Ther Exp (Warsz). 2009;57:13–18. doi: 10.1007/s00005-009-0001-5
  • Cohen J. Statistical power analysis for behavioral sciences. 2nd ed. New York, NY: Routledge; 1988.
  • Albanese MA, Mitchell S. Problem-based learning: a review of literature on its outcomes and implementation issues. Acad Med. 1993 Jan;68(1):52–81. doi: 10.1097/00001888-199301000-00012
  • Regehr G, MacRae H, Reznick RK, et al. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med. 1998 Sep;73(9):993–7.
  • Sutcliffe KM, Lewton E, Rosenthal MM. Communication failures: an insidious contributor to medical mishaps. Acad Med. 2004 Feb;79(2):186–94. doi: 10.1097/00001888-200402000-00019
  • Newble DI, Jaeger K. The effect of assessments and examinations on the learning of medical students. Med Educ. 1983 May;17(3):165–71. doi: 10.1111/j.1365-2923.1983.tb00657.x
  • Massaro TA. Introducing physician order entry at a major academic medical center: II. Impact on medical education. Acad Med. 1993 Jan;68(1):25–30. doi: 10.1097/00001888-199301000-00004
  • Faulkner H, Regehr G, Martin J, et al. Validation of an objective structured assessment of technical skill for surgical residents. Acad Med. 1996 Dec;71(12):1363–5.
  • Papadakis MA, Hodgson CS, Teherani A, et al. Unprofessional behavior in medical school is associated with subsequent disciplinary action by a state medical board. Acad Med. 2004 Mar;79(3):244–9.
  • Hojat M, Mangione S, Nasca TJ, et al. An empirical study of decline in empathy in medical school. Med Educ. 2004 Sep;38(9):934–41. doi: 10.1111/j.1365-2929.2004.01911.x
  • Palepu A, Friedman RH, Barnett RC, et al. Junior faculty members’ mentoring relationships and their professional development in U.S. medical schools. Acad Med. 1998 Mar;73(3):318–323. doi: 10.1097/00001888-199803000-00021
  • Vernon DT, Blake RL. Does problem-based learning work? A meta-analysis of evaluative research. Acad Med. 1993 Jul;68(7):550–63. doi: 10.1097/00001888-199307000-00015
  • Kaufman A, Mennin S, Waterman R, et al. The New Mexico experiment: educational innovation and institutional change. Acad Med. 1989 Jun;64(6):285–94.
  • Baer RA, Smith GT, Hopkins J, et al. Using self-report assessment methods to explore facets of mindfulness. Assessment. 2006, Mar;13(1):27–45. doi: 10.1177/1073191105283504
  • Peabody JW, Luck J, Glassman P, et al. Comparison of vignettes, standardized patients, and chart abstraction: a prospective validation study of 3 methods for measuring quality. JAMA. 2000 Apr 5;283(13):1715–22. doi: 10.1001/jama.283.13.1715
  • Roter DL, Hall JA, Kern DE, et al. Improving physicians’ interviewing skills and reducing patients’ emotional distress. A randomized clinical trial. Arch Intern Med. 1995 Sep 25;155(17):1877–1884.
  • Scott DJ, Bergen PC, Rege RV, et al. Laparoscopic training on bench models: better and more cost effective than operating room experience? J Am Coll Surg. 2000 Sep;191(3):272–283. doi: 10.1016/s1072-7515(00)00339-2
  • Fletcher G, Flin R, McGeorge P, et al. Anaesthetists’ non-technical skills (ANTS): evaluation of a behavioural marker system † †Declaration of interest: the ANTS system was developed under research funding from the Scottish Council for postgraduate medical and dental education, now part of NHS education for Scotland, through grants to the University of Aberdeen from September 1999 to August 2003. The views presented in this paper are those of the authors and should not be taken to represent the position or policy of the funding body. Br J Anaesth. 2003 May;90(5):580–8.
  • Docy F, Segers M, Van den Bossche P, et al. Effects of problem-based learning: a meta-analysis. Learn Instruct. 2003;13(5):533–568. doi: 10.1016/S0959-4752(02)00025-7
  • Wetzel MS, Eisenberg DM, Kaptchuk TJ. Courses involving complementary and alternative medicine at US medical schools. JAMA. 1998 Sep 2;280(9):784–787. doi: 10.1001/jama.280.9.784
  • Papadakis MA, Teherani A, Banach MA, et al. Disciplinary action by medical boards and prior behavior in medical school. N Engl J Med. 2005 Dec 22;353(25):2673–82. doi: 10.1056/NEJMsa052596
  • Feudtner C, Christakis DA, Christakis NA. Do clinical clerks suffer ethical erosion? students’ perceptions of their ethical environment and personal development. Acad Med. 1994, Aug;69(8):670–679. doi: 10.1097/00001888-199408000-00017
  • Norcini JJ, Blank LL, Duffy FD, et al. The mini-CEX: a method for assessing clinical skills. Ann Intern Med. 2003 Mar 18;138(6):476–81. doi: 10.7326/0003-4819-138-6-200303180-00012
  • AERA, APA & NCME. Standards for educational and psychological testing. Washington, DC: American Educational Research Association; 2014.
  • O’Brien BC, Harris IB, Beckman TJ, et al. Standards for reporting qualitative research: a synthesis of recommendations. Acad Med. 2014;89(9):1245–1251. doi: 10.1097/ACM.0000000000000388
  • Dunn G. Review papers: design and analysis of reliability studies. Stat Methods Med Res. 1992;1(2):123–57. doi: 10.1177/096228029200100202