Make the grading of university courses more reliable
If you want to grade exams from a large testing company, it takes a lot of work. PPotential evaluators of the education evaluation service, for example, experience system and content training, and at least one content certification test. Next, future levelers note several practices tests that have already been graded by established correctors.
Yes, on the other hand, you would like to mark exams for an undergraduate class, the standards are much lower. Often times, grading is left to graduate students or high performing undergraduates with a little more than the basics. classification guidelines. In fact, apart from a subset of educational researchers and psychologists, most faculty lack expertise in psychometrics, effective rubric design, or best assessment practices. The reality is that there is very little formal training for teachers – let alone for teaching assistants – on how to create effective assessment instruments. Therefore, while we would like university grades to be reliable and valid indicators of student achievement, in practice grades often contain a considerable amount of no Ise, mostly in more subjective areas.
Wisdom of crowds versus experts
While the grade noise problem can be addressed through rigorous training and calibration methods, a more resource-efficient approach has been identified by online open course (Mooc) providers looking for a way scalable to assess thousands (or tens of thousands) of students. Some Mooc providers ask students to rate other’s work, then rely on the wisdom of crowds to assign a grade. According to crowd wisdom research, the collective judgments of several uninformed individuals can be as accurate, if not more, than those of a single expert. Some people think too high, others too low, and the noise cancels out, leaving only the signal.
But how does it work for scoring?
To investigate this question, we asked graduate students to score essays written for a Mooc in their field of study and compare their scores to a crowd wisdom-based scoring strategy (averaging scores given by at least four of the student’s peers). The results were both promising and disturbing: we were encouraged to find that the marks awarded by the crowds were not significantly worse than those assigned by the experts. However, digging into the data, we found that the reason the scores were so similar was that the expert scores were, in many cases, as inconsistent as the crowd scores. In fact, the expert pairs agreed on the test scores only 20 percent of the time. For almost 30 percent of the tests, the scores differed by three points or more on a nine-point scale – the difference between receiving a B + on an exam and failing! In addition, in several cases the same expert read an essay two or three times and gave it different marks. Perhaps most disturbing was that the factors that we thought should most strongly predict grades (such as the accuracy of essay content) had very little influence on final grades, leaving us unsure of what. the experts based their scores.
If the marks awarded to students in college classes differ considerably depending on the marker, are they valid indicators of student success? As a feedback tool, ratings are only useful to the extent that they are accurate. More concerning, however, is how a bad bad grade could cause a student to drop out of a course and / or harm their chances of getting a job or being accepted into a higher school.
Resources to improve assessment practices
So how do you fix the problem? Although there are best practices in assessment and psychometrics, few teachers are aware of them or are knowledgeable enough to implement them. We should make known Resources that help faculty adopt better rubrics and assessment practices. Indeed, there are dozens of websites dedicated to meimproved assessment and most university education centers have specialists available to consult with professors. Often the most immediate problem is making teachers realize that they need these resources in the first place.
Second, it’s important to recognize that our judgment can be affected by seemingly irrelevant factors such as the weather, our hunger level, or even the time of day. We can improve consistency and accuracy by engaging in benchmarking exercises: having multiple markers read and grade the same small sample of exams each day before scoring. This can ensure that the assessors use the same standards and are aligned with each other. It can also help identify individual reviewers who deviate significantly from their peers and / or basic standards (and therefore may need additional training).
Finally, while we find that the wisdom of crowds among novices does not completely eliminate the problem of noise in notation, a large number of Literature has shown that it is better to average scores of two (or more) independent assessments than to do nothing. Indeed, even if there is only one rater available, ask that rater to give several estimates and average them (using a process called dialectical priming) can lead to improvements.
However it is done, universities need to take more care of the issue of noise when grading. This will help us ensure that students receive the specific feedback they need to learn and grow in the classroom. Additionally, since grades are such important determinants of socio-economic outcomes, reducing grade noise can help us reduce the likelihood that we will further contribute to injustice in society.
Paige Tsai is a PhD candidate in Technology and Operations Management at Harvard Business School. She is interested in the judgments and decisions of people in organizations. Danny Oppenheimer is Jointly Appointed Professor of Psychology and Decision Science at Carnegie Mellon University. He studies judgment, decision making, metacognition, learning, and causal reasoning, and applies his findings to areas such as charitable giving, consumer behavior, and how to get students to buy him cream. icy.