by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Mode of communication

Corpora may encode language produced in any mode of communication – for example there are corpora of spoken language and there are corpora of written language. Many corpora contain data from more than one mode, such as the British National Corpus (BNC).

Written corpora

Corpora representing written language usually present the smallest technical challenge to build, since much data already exists in electronic format (ew.g. on the web). Until recently, encoding writing systems other than the Roman alphabet was prone to error (Baker et al. 2000). However, with the advent of Unicode, this problem is being consigned to history. Written corpora can still be time consuming to produce when the materials have to be scanned or typed from printed or handwritten original documents. But in general, the construction of written corpora has never been easier.

Spoken corpora

Spoken corpus data is typically produced by recording interactions and then transcribing them. These transcriptions may be linked back systematically to the original recording through a process called time-alignment so that concordance results can be connected to the correct location in the sound file. This is possible, for example, with the COLT corpus of London teenage speech (Stenström et al. 2002) and the International Corpus of English British component (ICE-GB). Orthographically transcribed material is rarely a reliable source of evidence for research into variation in pronunciation; phonemically transcribed material is of much more use in this respect.

Other modes of communication

Corpora which include gesture, either as the primary channel for language (as in sign language corpora) or as a means of communication parallel to speech, are relatively new. Corpus linguistic studies focusing on the visual medium are only just beginning to be undertaken on a truly large scale, for example investigating the relationship between gesture and speech (Carter and Adolphs 2008), or constructing large corpora of sign language material (Johnston and Schembri 2006).


This page was last modified on Thursday 26 May 2011 at 4:45 am.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom