by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Concordancing tools

Introducing concordances

What tools for corpus analysis have been developed, and what kinds of analyses do they enable?

The single most important tool available to the corpus linguist is the concordancer. A concordancer allows us to search a corpus and retrieve from it a specific sequence of characters of any length — perhaps a word, part of a word, or a phrase. This is then displayed, typically in one-example-per-line format, as an output where the context before and after each example can be clearly seen.

The appearance of concordances does vary between different tools. Click here to see how concordances are displayed by four popular concordancers.

As well as concordances, three other functions are available in most modern corpus search tools:

These four basic tools for exploring a corpus are powerful aids to the linguist. But crucially, they also limit and define what we can do with a corpus — we cannot easily answer research questions which our analysis software is ill-suited for.

On the other hand, it is for practical reasons impossible to avoid using these tools. Analysing very large corpora without computers is a pseudo-procedure (Abercrombie 1965) — a method that is useful in principle but which is, for all practical purposes, impossible. It is simply too time-consuming. Fundamentally, the corpus-based approach to language cannot do without powerful searching software.

The history of corpus analysis tools

The very earliest tools for corpus analysis were created by Roberto Busa, who built the first automatic concordances in 1951. Busa's work would lead to what we will term first-generation concordancers.

First-generation concordancers

First-generation concordancers were typically held on a mainframe computer and used at a single site, such as the CLOC (Reed 1978) concordancer used at the University of Birmingham. Individual research teams would build their own concordancer and use it on the data they had access to locally. These tools typically did no more than provide a straightforward concordance. Any further analysis was done by separate programs.

Second-generation concordancers

Second-generation concordancers were a result of the rise of the personal computer in the 1980s were enabled by the spread of machines of one type in particular across the planet. Unlike the first-generation, they were designed to be installed and used on the analystís own machine. This was important in two ways. Firstly it meant that a lot of effort that had gone into reinventing the wheel could now be directed towards producing better tools. Secondly, it had a democratising effect. Up to this point corpus linguists typically needed to work in a team which included a computer scientist who was prepared to do whatever programming was needed on the local mainframe. With PC-based concordancing, any linguist who was able to switch on and use a PC could use corpora and apply corpus techniques to their own data.

Third-generation concordancers

The third generation of concordance software also runs mostly on PCs; it includes such well-known systems as WordSmith (Scott 1996), MonoConc (Barlow 2000), AntConc (Anthony 2005), and Xaira. Compared to the second generation, these concordancers are able to deal with large data sets on the PC (the hundred-million-word BNC is packaged with Xaira). Moreover, they include a wider range of tools than were previously available. Finally, they effectively support a range of writing systems.

Fourth-generation concordancers

The defining feature of fourth-generation concordancers is that they do not run on the user's own PC — instead, they are accessed via a web browser and actually run on a web server. These concordancers were created to address three issues:

Fourth-generation concordancers also allow corpus builders to make their work available immediately, and via a piece of software (the web browser) that all computer users are already familiar with. This avoids investing a lot of effort in the distribution of a corpus on disks or via download.

Most fourth-generation corpus analysis tools began as websites allowing users to search one specific corpus. But many of them have grown into generalisable systems. The most widely used are (Davies 2005), Wmatrix (Rayson 2008), SketchEngine (Kilgarriff et al. 2004), and BNCweb (Hoffmann et al. 2008) and its clone CQPweb (Hardie forthcoming).


This page was last modified on Monday 31 October 2011 at 12:45 am.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom