by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

English Corpus Linguistics at Lancaster University

It is unlikely that corpus linguistics would have developed at Lancaster University if Geoffrey Leech had not transferred from University College London to Lancaster. Leech extended the early work of the UCL team at Lancaster in three important ways. Firstly, he began to work with computer scientists to see how computer applications might help with corpus building. Secondly, he began to work on extending the range of corpus annotation. Finally, due to his collaboration with computer scientists especially, he found that he was able to build much larger corpora than had been possible at the Survey. The continuation of Leech's work on corpus annotation became perhaps Lancaster's more distinctive contribution to corpus linguistics.

Prominent among the annotation systems that at Lancaster was the first viable part-of-speech tagging program: CLAWS (the Constituent Likelihood Automatic Word-tagging System; Garside et al. 1987). This was the first fully automated part-of-speech annotation system that worked well across a range of genres (with 95%+ accuracy in most cases). This led to a fundamental change in corpus building. Rather than taking time to collect a corpus and then even more time to annotate it with basic word-class information, it was now possible to collect a corpus, annotate it while you had your lunch, and then work on the annotated corpus in the afternoon, so to speak. With the advent of automated annotation the process of corpus building was accelerated immeasurably.

By the early 1990s the Lancaster team had developed and applied many different types of annotations (Garside et al. 1997), covering not only written corpora, but also spoken corpora (Knowles 1993). The table below summarises the range of annotations developed at Lancaster.

Type of annotation Reference for annotation process Example of corpus annotated
Part-of-speech Garside et al. (1987) Numerous, including LOB and the BNC
Prosodic Knowles (1993) Spoken English Corpus
Parsing Sampson (1987) Lancaster-Leeds Treebank
Semantic Wilson and Rayson (1993) The Market Research Corpus
Anaphoric reference Fligelstone (1992) AP Newswire corpus
Literary stylistic Short et al. (1996), Semino and Short (2004) Speech, Writing and Thought Presentation Corpus
Pragmatic Archer and Culpeper (2003) A sub-part of the Corpus of English Dialogues 1560-1760


This page was last modified on Monday 16 April 2012 at 3:33 pm.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom