What tools for corpus analysis have been developed, and what kinds of analyses do they enable?
The single most important tool available to the corpus linguist is the concordancer. A concordancer allows us to search a corpus and retrieve from it a specific sequence of characters of any length — perhaps a word, part of a word, or a phrase. This is then displayed, typically in one-example-per-line format, as an output where the context before and after each example can be clearly seen.
The appearance of concordances does vary between different tools. Click here to see how concordances are displayed by four popular concordancers.
As well as concordances, three other functions are available in most modern corpus search tools:
- Frequency lists — the ability to generate comprehensive lists of words or annotations (tags) in a corpus, ordered either by frequency or alphabetically
- Collocations — statistical calculation of the words or tags that most typically co-occur with the node word you have searched for
- Keywords (or key tags) — lists of items which are unusually frequent in the corpus or text you are investigating, in comparison to a reference corpus; like collocation, calculated with statistical tests
These four basic tools for exploring a corpus are powerful aids to the linguist. But crucially, they also limit and define what we can do with a corpus — we cannot easily answer research questions which our analysis software is ill-suited for.
On the other hand, it is for practical reasons impossible to avoid using these tools. Analysing very large corpora without computers is a pseudo-procedure (Abercrombie 1965) — a method that is useful in principle but which is, for all practical purposes, impossible. It is simply too time-consuming. Fundamentally, the corpus-based approach to language cannot do without powerful searching software.
The history of corpus analysis tools
The very earliest tools for corpus analysis were created by Roberto Busa, who built the first automatic concordances in 1951. Busa's work would lead to what we will term first-generation concordancers.
First-generation concordancers were typically held on a mainframe computer and used at a single site, such as the CLOC (Reed 1978) concordancer used at the University of Birmingham. Individual research teams would build their own concordancer and use it on the data they had access to locally. These tools typically did no more than provide a straightforward concordance. Any further analysis was done by separate programs.
Second-generation concordancers were a result of the rise of the personal computer in the 1980s were enabled by the spread of machines of one type in particular across the planet. Unlike the first-generation, they were designed to be installed and used on the analystís own machine. This was important in two ways. Firstly it meant that a lot of effort that had gone into reinventing the wheel could now be directed towards producing better tools. Secondly, it had a democratising effect. Up to this point corpus linguists typically needed to work in a team which included a computer scientist who was prepared to do whatever programming was needed on the local mainframe. With PC-based concordancing, any linguist who was able to switch on and use a PC could use corpora and apply corpus techniques to their own data.
The third generation of concordance software also runs mostly on PCs; it includes such well-known systems as WordSmith (Scott 1996), MonoConc (Barlow 2000), AntConc (Anthony 2005), and Xaira. Compared to the second generation, these concordancers are able to deal with large data sets on the PC (the hundred-million-word BNC is packaged with Xaira). Moreover, they include a wider range of tools than were previously available. Finally, they effectively support a range of writing systems.
The defining feature of fourth-generation concordancers is that they do not run on the user's own PC — instead, they are accessed via a web browser and actually run on a web server. These concordancers were created to address three issues:
- the limited power of desktop PCs — desktop PCs can take a very long time to search the largest corpora (hundreds of millions of words), but by allowing access across the web to a powerful server, the fourth-generation tools have effectively decoupled local processing power from corpus searching
- problems arising from differing PC operating systems — second- and third- generation concordancers are often limited to either Windows or Unix, but because they are available across the web, fourth-generation systems are instantly available to users on any operating system
- legal restrictions on the distribution of corpora — we'll return to this issue in part 3.
Fourth-generation concordancers also allow corpus builders to make their work available immediately, and via a piece of software (the web browser) that all computer users are already familiar with. This avoids investing a lot of effort in the distribution of a corpus on disks or via download.
Most fourth-generation corpus analysis tools began as websites allowing users to search one specific corpus. But many of them have grown into generalisable systems. The most widely used are corpus.byu.edu (Davies 2005), Wmatrix (Rayson 2008), SketchEngine (Kilgarriff et al. 2004), and BNCweb (Hoffmann et al. 2008) and its clone CQPweb (Hardie forthcoming).
This page was last modified on Monday 31 October 2011 at 12:45 am.