Core values underlying my research are interdisciplinarity, grounding the study of meaning outside of language, situatedness, and synthesis of experimental and observational evidence.

  • Interdisciplinarity: I hold a commitment to engage with humanities and social science disciplines outside of linguistics. Modern computation, natural language processing, and corpora offer unique opportunities to engage in mass-scale observational research into phenomena that previously would have been studied only experimentally, via surveys, or with qualitative methods. Because human verbal behaviour plays some role within virtually all humanistic disciplines, observational study of verbal behaviour at mass scale with corpora gives the opportunity to build relations between research interests. My success at building such bridges is evidenced by my interactions with the discipline of social psychology (Snefjella & Kuperman, 2015; Snefjella, Schmidtke, & Kuperman, 2018).
  • Grounding meaning outside of semantics: I believe that a scientific explanation of semantics must be grounded within entities from outside of semantics (Westbury, 2016). The boundaries between cognition and other mental systems (e.g. perception, the motor system, affect) are more porous than previously believed. A cognitively informed corpus-linguistics should reflect the role of these nonsemantic entities in the data generation process of a corpus, particularly mental simulation, affect, and the motor system. Current barriers to grounding corpus-based research relate more to the availability of data than the availability of putative entities for grounding semantics.
  • Situatedness: Natural human verbal behaviour always has a context. Large corpora provide rich measurements of linguistic context. Contemporary social media corpora offer additional information about context, such as the location where an utterance was made, the individual who made the utterance and their previous language use, genre or domain, position with a connected discourse (e.g. location with a subreddit), indications of of popularity or engagement, time of utterance, and more. These unique facts about contexts can be leveraged to answer specific hypotheses, for example the relation between distance and abstractness of language use (Snefjella & Kuperman, 2015).
  • Synthesis of experiments and observation: Quantities can be easily derived from corpora, but they are not necessarily meaningful. Experiments using corpus derived quantities can show their relevance for understanding brain and behaviour. Experiments are expensive to undertake, typically test an unrepresentative population (undergraduate students), have limited or no historical scope, and elicit behavioural or brain responses in an unnatural setting. Corpora are a by-product of natural communication and are not, typically, elicited by experimenters. Like any observational approach, sound inference with corpora is challenging because of the possibility of unmeasured confounders. Experiments can establish causation, but establishing causation with observational evidence is challenging. Both methods are therefore complementary and necessary. Snefjella and Kuperman (2016) provides a demonstration of this principle; the importance of the corpus-linguistic notion of semantic prosody is clearly demonstrated by both measuring it in a corpus, and showing numerous effects of these measurements in single word processing.

Moving forward, my research agenda will be to diversify the non-semantic entities used to ground meaning, to explore the distributional properties of words related to these entities within corpora, and to understand the implications of these distributional properties for brain and behaviour.



Westbury, C. (2016). Pay no attention to that man behind the curtain. The Mental Lexicon, 11(3), 350–374.