The Challenges of Large‐Scale, Web‐Based Language Datasets: Word Length and Predictability Revisited

Cognitive Science 45 (6):e12983 (2021)
  Copy   BIBTEX

Abstract

Language research has come to rely heavily on large‐scale, web‐based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long‐standing challenges in corpus‐based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisiting “Word lengths are optimized for efficient communication” (Piantadosi, Tily, & Gibson, 2011), which found that word lengths in 11 languages are more strongly correlated with their average predictability (or average information content) than their frequency. Using what we argue to be best practices for large‐scale corpus analyses, we find significantly attenuated support for this result and demonstrate that a stronger relationship obtains between word frequency and length for a majority of the languages in the sample. We consider the implications of the results for language research more broadly and provide several recommendations to researchers regarding best practices.

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 91,322

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Similar books and articles

Information Studies Without Information.Jonathan Furner - 2004 - Library Trends 52 (3):427-446.
Biology Needs Information Theory.Gérard Battail - 2013 - Biosemiotics 6 (1):77-103.
The Strategic Use of Noise in Pragmatic Reasoning.Leon Bergen & Noah D. Goodman - 2015 - Topics in Cognitive Science 7 (2):336-350.
Querying linguistic trees.Catherine Lai & Steven Bird - 2010 - Journal of Logic, Language and Information 19 (1):53-73.

Analytics

Added to PP
2021-06-26

Downloads
17 (#843,162)

6 months
9 (#298,039)

Historical graph of downloads
How can I increase my downloads?

Author's Profile

Tom Griffiths
Aarhus University