References
- Adams, R. J., Wu, M. L., & Wilson, M. R. (2012). ACER ConQuest 3.0. 1 [Computer Software]. Australian Council for Educational Research.
- Alonzo, A. C., & Steedle, J. T. (2009). Developing and assessing a forces and motion learning progression. Science Education, 93(3), 389–421. https://doi.org/10.1002/sce.20303
- Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42(1), 69–81. https://doi.org/10.1007/BF02293746
- Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
- Andrich, D. (2005). The Rasch model explained. In Maclean R. et al. (Ed.), Applied Rasch measurement: A book of exemplars (Vol. 4, pp. 27–59). Dordrecht: Springer.
- Andrich, D. (2010). Sufficiency and conditional estimation of person parameters in the polytomous Rasch model. Psychometrika, 75(2), 292–308. https://doi.org/10.1007/s11336-010-9154-8
- Arslan, H. O., Cigdemoglu, C., & Moseley, C. (2012). A three-tier diagnostic test to assess pre-service teachers’ misconceptions about global warming, greenhouse effect, ozone layer depletion, and acid rain. International Journal of Science Education, 34(11), 1667–1686. https://doi.org/10.1080/09500693.2012.680618
- Bejar, I. I., & Weiss, D. J. (1977). A comparison of empirical differential option weighting scoring procedures as a function of inter-item correlation. Educational and Psychological Measurement, 37(2), 335–340.
- Ben-Simon, A., Budescu, D. V., & Nevo, B. (1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21(1), 65–88.
- Bennett, R. E. (2010). Cognitively based assessment of, for, and as learning (CBAL): A preliminary theory of action for summative and formative assessment. Measurement, 8(2-3), 70–91. https://doi.org/10.1080/15366367.2010.508686
- Bo, Y. E., Lewis, C., & Budescu, D. V. (2015). An option-based partial credit item response model. In R. Millsap, D. Bolt, L. van der Ark, & W. C. Wang (Eds.), Quantitative psychology research (pp. 45–72). Springer.
- Bond, T. G., Fox, C. M., & Lacey, H. (2007). Applying the Rasch model: Fundamental measurement in the social sciences (2nd ed.). Routledge Taylor and Francis Group.
- Briggs, D. C., Alonzo, A. C., Schwab, C., & Wilson, M. (2006). Diagnostic assessment with ordered multiple-choice items. Educational Assessment, 11(1), 33–63. https://doi.org/10.1207/s15326977ea1101_2
- Brown, J. (1965). Multiple response evaluation of discrimination. British Journal of Mathematical and Statistical Psychology, 18(1), 125–137. https://doi.org/10.1111/j.2044-8317.1965.tb00696.x
- Bush, M. (2001). A multiple choice test that rewards partial knowledge. Journal of Further and Higher Education, 25(2), 157–163. https://doi.org/10.1080/03098770120050828
- Bush, M. (2015). Reducing the need for guesswork in multiple-choice tests. Assessment & Evaluation in Higher Education, 40(2), 218–231. https://doi.org/10.1080/02602938.2014.902192
- Cavers, M., & Ling, J. (2016). Confidence weighting procedures for multiple-choice tests. In D. G. Chen, J. Chen, X. Lu, G. Yi, & H. Yu (Eds.), Advanced statistical methods in data science (pp. 171–181). Springer.
- Champagne, A. B., Klopfer, L. E., & Anderson, J. H. (1980). Factors influencing the learning of classical mechanics. American Journal of Physics, 48(12), 1074–1079. https://doi.org/10.1119/1.12290
- Chi, S., Wang, Z., & Liu, X. (2019). Investigating disciplinary context effect on student scientific inquiry competence. International Journal of Science Education, 41(18), 2736–2764.
- Clement, J. (1982). Students’ preconceptions in introductory mechanics. American Journal of Physics, 50(1), 66–71. https://doi.org/10.1119/1.12989
- Coombs, C. H. (1953). On the use of objective examinations. Educational and Psychological Measurement, 13(2), 308–310.
- Coombs, C. H., Milholland, J. E., & Womer, F. B. (1956). The assessment of partial knowledge. Educational and Psychological Measurement, 16(1), 13–37.
- Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 1–28.
- Cross, L. H., Ross, F. K., & Geller, E. S. (1980). Using choice-weighted scoring of multiple-choice tests for determination of grades in college courses. The Journal of Experimental Education, 48(4), 296–301.
- Davis, F. B., & Fifer, G. (1959). The effect on test reliability and validity of scoring aptitude and achievement tests with weights for every choice. Educational and Psychological Measurement, 19(2), 159–170. https://doi.org/10.1177/001316445901900202
- Davis, F. B., & Fifer, G. (1959). The effect on test reliability and validity of scoring aptitude and achievement tests with weights for every choice. Educational and Psychological Measurement, 19(2), 159–170.
- DeMars, C. E. (2008, March). Scoring multiple choice items: A comparison of IRT and classical polytomous and dichotomous methods. Paper presented at the national Council on Measurement in education, New York, NY, USA.
- Diedenhofen, B., & Musch, J. (2015). Empirical option weights improve the validity of a multiple-choice knowledge test. European Journal of Psychological Assessment, 33(5), 336–344. https://doi.org/10.1027/1015-5759/a000295
- disessa, A. A. (2014). A history of conceptual change research: Threads and fault lines. In The Cambridge handbook of the learning sciences (2nd ed.). UC Berkeley.
- diSessa, A. (1983). Phenomenology and the evolution of intuition. In D. Gentner, & A. L. Stevens (Eds.), Mental models (pp. 15–33). Erlbaum.
- diSessa, A. (2007). Changing conceptual change. Human Development, 50(1), 39–46. https://doi.org/10.1159/000097683
- diSessa, A. (2013). A bird’s-eye view of the “pieces” vs “coherence” controversy (from the “pieces” side of the fence”). Stella Vosniadou, págs. 31-48.
- disessa, A. A., & Sherin, B. L. (1998). What changes in conceptual change? International Journal of Science Education, 20(10), 1155–1191.
- Dressel, P., & Schmid, J. (1953). Some modifications of the multiple-choice item. Educational and Psychological Measurement, 13(4), 574–595. https://doi.org/10.1177/001316445301300404
- Echternacht, G. (1976). Reliability and validity of item option weighting schemes. Educational and Psychological Measurement, 36(2), 301–309.
- Frary, R. B. (1989). Partial-credit scoring methods for multiple-choice tests. Applied Measurement in Education, 2(1), 79–96. https://doi.org/10.1207/s15324818ame0201_5
- Fulmer, G. W., Liang, L. L., & Liu, X. (2014). Applying a forces and motion learning progression over an extended time span using the force concept inventory. International Journal of Science Education, 36(17), 2918–2936. https://doi.org/10.1080/09500693.2014.939120
- Gao, Y., Zhai, X., Andersson, B., Zeng, P., & Xin, T. (2020). Developing a learning progression of buoyancy to model conceptual change: A latent class and rule space model analysis. Research in Science Education, 50(4), 1369–1388.
- Gilman, D. A., & Ferry, P. (1972). Increasing test reliability through self-scoring procedures. Journal of Educational Measurement, 9(3), 205–207.
- Haladyna, T. M. (2004). Developing and validating multiple-choice test items (3 ed.). Lawrence Erlbaum.
- Halloun, I. A., & Hestenes, D. (1985). Common sense concepts about motion. American Journal of Physics, 53(11), 1056–1065. https://doi.org/10.1119/1.14031
- Hardy, J., Bates, S. P., Casey, M. M., Galloway, K. W., Galloway, R. K., Kay, A. E., … McQueen, H. A. (2014). Student-generated content: Enhancing learning through sharing multiple-choice questions. International Journal of Science Education, 36(13), 2180–2194. https://doi.org/10.1080/09500693.2014.916831
- Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & Debarger, A. H. (2019). Designing knowledge-in-use assessments to promote deeper learning. Educational Measurement: Issues and Practice, 38(2), 53–67.
- Härtig, H., Nordine, J. C., & Neumann, K. (2020). Contextualisation in the assessment of students’ learning about science. In T. I. Sánchez (Ed.), International perspectives on the contextualisation of science education (pp. 113–144). Springer.
- Higham, P. A. (2013). Regulating accuracy on university tests with the plurality option. Learning and Instruction, 24, 26–36. https://doi.org/10.1016/j.learninstruc.2012.08.001
- Kansup, W., & Hakstian, A. R. (1975). A comparison of several methods of assessing partial knowledge in multiple-choice tests: I. Scoring procedures. Journal of Educational Measurement, 12, 219–230.
- Koretsky, M. D., Brooks, B. J., & Higgins, A. Z. (2016). Written justifications to multiple-choice concept questions during active learning in class. International Journal of Science Education, 38(11), 1747–1765. https://doi.org/10.1080/09500693.2016.1214303
- Lesage, E., Valcke, M., & Sabbe, E. (2013). Scoring methods for multiple choice assessment in higher education–Is it still a matter of number right scoring or negative marking? Studies in Educational Evaluation, 39(3), 188–193. https://doi.org/10.1016/j.stueduc.2013.07.001
- Linacre, J. (2002). Judging debacle in pairs figure skating. Rasch Measurement Transactions, 15(4), 839–840. https://www.rasch.org/rmt/rmt154a.htm
- Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
- Minstrell, J. (1992). Facets of students’ knowledge and relevant instruction. Research in Physics Learning: Theoretical Issues and Empirical Studies, 110–128.
- Mortaz Hejri, S., Khabaz Mafinezhad, M., & Jalili, M. (2014). Guessing in multiple choice questions: Challenges and strategies. Iranian Journal of Medical Education, 14(7), 594–604. https://www.sid.ir/en/journal/ViewPaper.aspx?id=428372
- National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press.
- Nedelsky, L. (1954). Ability to avoid gross error as a measure of achievement. Educational and Psychological Measurement, 14(3), 459–472. https://journals.sagepub.com/doi/pdf/10.1177/001316445401400303?casa_token=uxRB79T8IyUAAAAA:-KJYVOQgd2W7P8xIUmFn6gHii0wrojDi3keeo2dkOp2fAhCf91LiasTCFQ__bv3W4x6j2DPWEgf1
- Neumann, I., Neumann, K., & Nehm, R. (2011). Evaluating instrument quality in science education: Rasch-based analyses of a nature of science test. International Journal of Science Education, 33(10), 1373–1405.
- Neumann, K., Viering, T., Boone, W. J., & Fischer, H. E. (2013). Towards a learning progression of energy. Journal of Research in Science Teaching, 50(2), 162–188. https://doi.org/10.1002/tea.21061
- NGSS Lead States. (2013). Next generation science standards: For states, by states. National Academies Press.
- Oon, P. T., & Fan, X. (2017). Rasch analysis for psychometric improvement of science attitude rating scales. International Journal of Science Education, 39(6), 683–700. https://doi.org/10.1080/09500693.2017.1299951
- Opitz, S. T., Harms, U., Neumann, K., Kowalzik, K., & Frank, A. (2015). Students’ energy concepts at the transition between primary and secondary school. Research in Science Education, 45(5), 691–715. https://doi.org/10.1007/s11165-014-9444-8
- Opitz, S. T., Neumann, K., Bernholt, S., & Harms, U. (2017). How do students understand energy in biology, chemistry, and physics? Development and validation of an assessment instrument. Eurasia Journal of Mathematics, Science and Technology Education, 13(7), 3019–3042. https://doi.org/10.12973/eurasia.2017.00703a
- Patnaik, D., & Traub, R. E. (1973). Differential weighting by judged degree of correctness. Journal of Educational Measurement, 10(4), 281–286.
- Piaget, J. (1966). The child’s conception of physical causality by jean piaget; marjorie gabain. Littlefield, Adams.
- Rasch, G. (1961, June). On general laws and the meaning of measurement in psychology. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, 4, 321–333.
- Romine, W. L., Barrow, L. H., & Folk, W. R. (2013). Exploring secondary students’ knowledge and misconceptions about influenza: Development, validation, and implementation of a multiple-choice influenza knowledge scale. International Journal of Science Education, 35(11), 1874–1901. https://doi.org/10.1080/09500693.2013.778439
- Romine, W. L., Schaffer, D. L., & Barrow, L. (2015). Development and application of a novel Rasch-based methodology for evaluating multi-tiered assessment instruments: Validation and utilization of an undergraduate diagnostic test of the water cycle. International Journal of Science Education, 37(16), 2740–2768.
- Ruiz-Primo, M. A., Zhai, X., Li, M., Hernandez, D., Kanopka, K., Dong, D., Minstrell, J. (2019). Contextualised science assessments: Addressing the use of information and generalisation of inferences of students’ performance. Paper presented at the annual conference of the AERA, Toronto, Canada, April.
- Sadler, P. M. (1998). Psychometric models of student conceptions in science: Reconciling qualitative studies and distractor-driven assessment instruments. Journal of Research in Science Teaching, 35(3), 265–296.
- Shuford, E. H., Albert, A., & Massengill, H. E. (1966). Admissible probability measurement procedures. Psychometrika, 31(2), 125–145.
- Slepkov, A. D., & Godfrey, A. T. K. (2019). Partial credit in answer-until-correct multiple-choice tests deployed in a classroom setting. Applied Measurement in Education, 32(2), 138–150. https://doi.org/10.1080/08957347.2019.1577249
- Slepkov, A. D., Vreugdenhil, A. J., & Shiell, R. C. (2016). Score increase and partial-credit validity when administering multiple-choice tests using an answer-until-correct format. Journal of Chemical Education, 93(11), 1839–1846. https://doi.org/10.1021/acs.jchemed.6b00028
- Socan, G. (2015). Empirical option weights for multiple-choice items: Interactions with item properties and testing design. Advances in Methodology & Statistics / Metodoloski Zvezki, 12(1), 25–43. https://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,shib&db=a9h&AN=113123494&site=ehost-live
- Toffoli, S. F. L., de Andrade, D. F., & Bornia, A. C. (2016). Evaluation of open items using the many-facet Rasch model. Journal of Applied Statistics, 43(2), 299–316. https://doi.org/10.1080/02664763.2015.1049938
- Woitkowski, D. (2020). Tracing physics content knowledge gains using content complexity levels. International Journal of Science Education, 42(10), 1585–1608. https://doi.org/10.1080/09500693.2020.1772520
- Wright, B. D., & Stone, M. H. (1979). Best test design. MESA Press.
- Wu, M. L., Adams, R., Wilson, M., & Haldane, S. (2007). ACER conquest version 2.0. ACER Press, Australian Council for Educational Research.
- Zhai, X., Haudek, K. C., Stuhlsatz, M. A., & Wilson, C. (2020). Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment. Studies in Educational Evaluation, 67, 100916. https://doi.org/10.1016/j.stueduc.2020.100916
- Zhai, X., Li, M., & Guo, Y. (2018). Teachers' use of learning progression-based formative assessment to inform teachers' instructional adjustment: A case study of two physics teachers' instruction. International Journal of Science Education, 40(15), 1832–1856. https://doi.org/10.1080/09500693.2018.1512772