Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy

Corey Palermo

Download from

dx.doi.org

More download options

Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy

Corey Palermo

Frontiers in Psychology 13 (2022) Copy BIBT_EX

Abstract

Raters may introduce construct-irrelevant variance when evaluating written responses to performance assessments, threatening the validity of students’ scores. Numerous factors in the rating process, including the content of students’ responses, the characteristics of raters, and the context in which the scoring occurs, are thought to influence the quality of raters’ scores. Despite considerable study of rater effects, little research has examined the relative impacts of the factors that influence rater accuracy. In practice, such integrated examinations are needed to afford evidence-based decisions of rater selection, training, and feedback. This study provides the first naturalistic, integrated examination of rater accuracy in a large-scale assessment program. Leveraging rater monitoring data from an English language arts summative assessment program, I specified cross-classified, multilevel models via Bayesian estimation to decompose the impact of response content, rater characteristics, and scoring contexts on rater accuracy. Results showed relatively little variation in accuracy attributable to teams, items, and raters. Raters did not collectively exhibit differential accuracy over time, though there was significant variation in individual rater’s scoring accuracy from response to response and day to day. I found considerable variation in accuracy across responses, which was in part explained by text features and other measures of response content that influenced scoring difficulty. Some text features differentially influenced the difficulty of scoring research and writing content. Multiple measures of raters’ qualification performance predicted their scoring accuracy, but general rater background characteristics including experience and education did not. Site-based and remote raters demonstrated comparable accuracy, while evening-shift raters were slightly less accurate, on average, than day-shift raters. This naturalistic, integrated examination of rater accuracy extends previous research and provides implications for rater recruitment, training, monitoring, and feedback to improve human evaluation of written responses.

Cite

Plain text

BibTeX

Formatted text

Zotero

EndNote

Reference Manager

RefWorks

Options

Edit

Mark as duplicate

Find it on Scholar

Request removal from index

Revision history

Keywords

Add keywords

Reprint years

DOI

10.3389/fpsyg.2022.937097

My notes

Analytics

Added to PP
2022-08-11

Downloads
6 (#1,479,581)

6 months
4 (#1,004,663)

Historical graph of downloads

How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

Sequential Effects in Essay Ratings: Evidence of Assimilation Effects Using Cross-Classified Models.Zhao Haiyan, Andersson Björn, Guo Boliang & Xin Tao - 2017 - Frontiers in Psychology 8.

Add more references

Applied ethics	Epistemology	History of Western Philosophy	Meta-ethics	Metaphysics	Normative ethics
Philosophy of biology	Philosophy of language	Philosophy of mind	Philosophy of religion	Science Logic and Mathematics	More ...

Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy

Abstract

Categories

Keywords

Reprint years

DOI

Links

PhilArchive

External links

Through your library

My notes

Similar books and articles

Analytics

Citations of this work

References found in this work