Abstract
Evidence from numerous studies using the visual world paradigm has revealed both that spoken language can rapidly guide attention in a related visual scene and that scene information can immediately influence comprehension processes. These findings motivated the coordinated interplay account (Knoeferle & Crocker, 2006) of situated comprehension, which claims that utterance‐mediated attention crucially underlies this closely coordinated interaction of language and scene processing. We present a recurrent sigma‐pi neural network that models the rapid use of scene information, exploiting an utterance‐mediated attentional mechanism that directly instantiates the CIA. The model is shown to achieve high levels of performance (both with and without scene contexts), while also exhibiting hallmark behaviors of situated comprehension, such as incremental processing, anticipation of appropriate role fillers, as well as the immediate use, and priority, of depicted event information through the coordinated use of utterance‐mediated attention to the scene.