by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Statistics in corpus linguistics

Corpora are an unparalleled source of quantitative data for linguists. So corpus linguists often test or summarise their quantitative findings through statistics. Some other areas of linguistics also frequently appeal to statistical notions and tests. Psycholinguistic experiments, grammatical elicitation tests and survey-based investigations, for example, all commonly involve statistical tests of some sort. However, frequency data are so regularly produced in corpus analysis that most corpus-based studies undertake some form of statistical analysis, even if it is relatively basic and descriptive, e.g. using percentages to describe the data in some way.

Descriptive statistics

Most studies in corpus linguistics use basic descriptive statistics if nothing else. Descriptive statistics are statistics which do not seek to test for significance. Rather they simply describe the data in some way. The most basic statistical measure is a frequency count, a simple tallying of the number of instances of something that occurs in a corpus – for example, there are 1,103 examples of the word Lancaster in the written section of the BNC. We may express this as a percentage of the whole corpus; the BNC's written section contains 87,903,571 words of running text, meaning that the word Lancaster represents 0.013% of the total data in the written section of the corpus. The percentage is just another way of looking at the count 1,103 in context, to try to make sense of it relative to the totality of the written corpus. Sometimes, as is the case here, the percentage may not convey meaningfully the frequency of use of the word, so we might instead produce a normalised frequency (or relative frequency), which answers the question ‘how often might we assume we will see the word per x words of running text?' Normalised frequencies are usually given per thousand words or per million words.

A special type of ratio called the type-token ratio is another basic corpus statistics. A token is any instance of a particular wordform in a text. Comparing the number of tokens in the text to the number of types of tokens — where each type is a particular, unique wordform — can tell us how large a range of vocabulary is used in the text. We determine the type-token ratio by dividing the number of types in a corpus by the number of tokens. The result is sometimes multiplied by 100 to express the type-token ratio as a percentage. This allows us to measure vocabulary variation between corpora — the closer the result is to 1 (or 100 if it's a percentage), the greater the vocabulary variation; the further the result is from 1, the less vocabulary variation there is. Since the size of the corpus affects its type-token ratio, only similar-sized corpora can be compared in this way. For corpora that differ in size, a normalising version of the procedure (standardised type-token ratio or STTR) is used instead.

Beyond descriptive statistics

To better understand the frequency data arising from a corpus, corpus linguists appeal to statistical measures which allow them to test the significance of any differences observed. Most things that we want to measure are subject to a certain amount of “random” fluctuation. We can use significance tests to assess how likely it is that a particular result is a coincidence, due simply to chance. Typically, if there is a 95% chance that our result is not a coincidence, then we say that the result is significant. A result which is not significant cannot be relied on, although it may be useful as an indication of where to start doing further research (maybe with a bigger sample of data).

The two most common uses of significance tests in corpus linguistics are calculating keywords (or key tags) and calculating collocations. To extract keywords, we need to test for significance every word that occurs in a corpus, comparing its frequency with that of the same word in a reference corpus. When looking for a word's collocations, we test the significance of the co-occurrence frequency of that word and everything that appears near it once or more in the corpus. Both procedures typically involve, then, many thousands of significance tests being carried out. This is all done behind the scenes in those tools that support keyword and collocation extraction. When we wish to apply significance tests to other quantitative data extracted from a corpus, however, we cannot normally count on the analysis software to handle the details for us; we must carry out the procedure ourselves.

Doing a significance test

The UCREL log-likelihood wizard, created by Paul Rayson, allows you to perform tests for a significant difference in frequency between two corpora. It is based on four simple figures. Let's assume we are testing a difference between Corpus 1 and Corpus 2 in the frequency of some linguistic phenomenon X. In this case, the figures you need are:

Very often, we are testing a for a difference in the frequency of a word. In this case, the “number of opportunities” is simply the total number of words in the corpus. On the other hand, if we were looking at the frequency of a particular type of sentence (e.g. declarative as opposed to interrogative) then the “number of opportunities” would be the total number of sentences in the corpus - since even if declaratives appear everywhere, there cannot in principle be more declaratives than there are sentences!

Whatever we are testing, however, all figures must be absolute, not normalised, frequencies. The significance test itself takes account of the size of the corpus, so you should never use normalised frequencies as the input data.

When we have our four figures, we can insert them into the following form:

Corpus 1Corpus 2
Frequency of X
(e.g. freq of word)
Total opportunities for X
(e.g. Corpus size)

Imagine, for example, that you are investigating a word that occurs 52 times in Corpus 1, which has 50,000 tokenws in total; but occurs 57 times in Corpus 2, which is 75,000 tokens in size. Obviously, this word is noticeably rarer, in relative terms, in Corpus 2; but is the difference significant?

Enter the figures into the web-form above to conduct the log-likelihood test of significance! Don't include any commas in the numbers you type in.

You should get results that look like this:

Item           O1       %1     O2       %2        LL
Word           52     0.10     57     0.08 +    2.65     

Here's how to interpre3t this result:

The higher the LL is, the less likely it is that the result is a random fluke. The LL must be above 3.84 for the difference to be significant at the p < 0.05 level (also called the 95% level). So this difference is not statistically significant.

A keyword analysis basically consists of doing this analysis for every word-type in the corpus! The keywords list is sorted by the significance score, with the most significant items at the top.

Finding out more

If you want to find out more about statistics in corpus linguistics, three of the best readings are Oakes (1998), Baayen (2008) or Gries (2009). Warning! All these books are comprehensive, but involve a very steep learning curve, especially for readers without much background in statistics.


This page was last modified on Monday 8 November 2021 at 9:10 pm.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom