by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Data collection regimes

Two broad approaches to the issue of choosing what data to collect have emerged: the monitor corpus approach (see Sinclair 1991: 24-6), where the corpus continually expands to include more and more texts over time; and the balanced corpus or sample corpus approach (see Biber 1993 and Leech 2007).

Monitor corpora

A monitor corpus is a dataset which grows in size over time and contains a variety of materials. The relative proportions of different types of materials may vary over time. The Bank of English (BoE), developed at the University of Birmingham, is the best known example of a monitor corpus. The BoE was started in the 1980s (Hunston 2002: 15) and has expanded since then to well over half a billion words. The BoE represents one approach to the monitor corpus; the Corpus of Contemporary American English (COCA; Davies 2009b) represents another. COCA expands over time like a monitor corpus, yet it does so according to a much more explicit design than the BoE. Each extra section added to COCA complies to the same, set breakdown of text-varieties. This corpus represents something of a halfway house – a monitor corpus that proceeds according to a sampling frame and regular sampling regime.

Balanced corpora

In contrast to monitor corpora, balanced corpora, also known as sample corpora, try to represent a particular type of language over a specific span of time. In doing so they seek to be balanced and representative within a particular sampling frame. So, for example, if we want to look at the language of service interactions in shops in the UK in the late 1990s, the sampling frame is clear we would only accept data into our corpus which represents interactions of this sort. Following the principle of balance, we would try to characterise the range of shops whose language we wanted to sample, and collect data evenly from across that range. We would also have to choose the locations to sample from, with the aim of achieving representativeness for the data in a corpus (see Leech 2007 for a critical exploration of this concept).

A good example of a corpus that seeks balance and representativeness within a given sampling frame is the Lancaster-Oslo/Bergen (LOB) corpus. This represents a ‘snapshot’ of the standard written form of modern British English in the early 1960s across a range of 2,000 word samples. This table shows its sampling frame:

The LOB Corpus Sampling Frame (after Hofland and Johansson 1982: 2)
Category Description Number of text samples in this category
A Press: reportage 44
B Press: editorial 27
C Press: reviews 17
D Religion 17
E Skills, trades and hobbies 38
F Popular lore 44
G Belles lettres, biography, essays 77
H Miscellaneous
(government documents, foundation reports,
industry reports, college catalogues,
industry house organ)
J Learned and scientific writings 80
K General fiction 29
L Mystery and detective fiction 24
M Science fiction 6
N Adventure and western fiction 29
P Romance and love story 29
R Humour 9
Total   500

Opportunistic corpora

There are many corpora that do not necessarily match the description of either a monitor or a sample corpus comfortably. Such corpora are best described as opportunistic corpora. These corpora do not adhere to a rigorous sampling frame. Rather, they represent nothing more nor less than the data that it was possible to gather for a specific task. Sometimes technical restrictions prevent the collection of data to populate an idealised sampling frame. This was particularly common prior to widespread electronic publishing and the web. Today, an opportunistic approach is often needed with spoken data in particular: converting spoken recordings into machine-readable transcriptions is a very time consuming task.


This page was last modified on Thursday 30 October 2014 at 2:35 pm.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom