by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012
 

Answers to exercises: Chapter One discussion questions

Q1-1) Balance and representativeness in corpus design

We asked you to consider the design of a hypothetical corpus, thinking particularly about these questions:

Our answer: Claims about balance and representativeness must always be qualified, certainly for corpora of this (relatively small) scale. Considering this corpus design in particular, although it covers a fair range of material, there are biases towards certain types of material.

For example, all the spoken data is broadcast speech. Is this representative? Well, if broadcast speech is what the corpus-builders wanted to represent, then in that limited sense we the corpus-builders say that the corpus is representative, perhaps even balanced. But it it is not generally representative of speech, as important types of speech (such as informal, colloquial dialogue) are just completely missing. So one important qualification about representativeness is that it is always relative to what you are trying to accomplish in the first place, and the scope of what you are investigating.

We could, perhaps, be slightly more confident in calling this corpus representative of written data – since it includes many different types of writing, both published and unpublished, though again not every type of writing is represented. What would be harder would be to defend the written part of this corpus as balanced. Look at the sizes of each section of the corpus and consider them proportionally. There are very great differences in how big each part of the corpus is. The question is, where did these numbers come from? what motivated them? Why, for instance, is there five times as much fiction as there is academic writing? At some point in the design of this corpus, someone made a decision that the corpus would better represent the language as a whole if it had five times the amount of fiction as academic writing. But exactly how could you justify such a decision? You might say that a lot more people read fiction than read academic writing. But is readership the important factor for a balanced corpus – might there not be other factors? Inevitably the choice will have been to at least some degree subjective, goakl-oriented, and thus subject to dsisagreement.

And in fact, it is always possible to criticise decisions about balance in corpus design – even if the balance is as straightforward as possible. In this corpus's spoken data, for instance, there is a 50-50 split between news broadcasts and talk shows. We could criticise this, however, on any number of grounds. We might say the lots more people watch the news than watch talk shows, so the news broadcast language should have a higher level of representation than the talk shows. Or, alternatively, we might say that talk shows are more often unscripted and unplanned than new broadcassts, so the corpus really ought to contain more talk shows, since they are closer to unplanned, spontaneous speech. Again, which of these arguments you wish to make depends very much on what you are trying to accomplish with the corpus, i.e. what you want the corpus for.

So while this would undoubtedly be a very useful corpus to have, and could in a qualified sense be called balanced and representative, it is not in any unqualified sense balanced and representative. Indeed it is difficult to think of a corpus in existence for which we could make such an unqualified claim.

Q1-2) Have a look at three or four research papers from the recent primary literature on corpus linguistics, and consider where each study stands, relative to the different criteria introduced in Chapter One.

Here are the links to the papers we suggested as reading for this discussion question. Note that some, but not all, are open-access – but hopefully given the length of the list any reader will be able to get hold of at least a subset of these materials.

These papers show the breadth of corpus linguistics with regard to the criteria we have asked you to look at. Reading them should allow you to better understand the oppositions that we have used in this chapter to give the subject of corpus linguistics some shape.

The table below gives a very quick overview of what we think the “answers” are in each case, but note that on many of these points, there is much more to be said than simply pigeonholing each study into one category or the other!

  Mode of language Main approach Corpus type Annotation? Total accountability? Mono / multilingual?
Culpeper (2009) Written-to-be-spoken Corpus-based Opportunistic[1] Yes Yes Monolingual
Calude (2008) Spoken Corpus-based Sample No No Monolingual
Chung (2008) Written Corpus-based Opportunistic No Debatable Multilingual
Diani (2008) Written AND spoken Uncommitted Sample No Yes Monolingual
Hunston (2007) Mostly written Corpus-driven Mostly monitor No Yes Monolingual
Oakes and Farrow (2007) Written Uncommitted Sample No Yes Monolingual
Inaki and Okita (2006) Written Corpus-based Opportunistic No Partially Monolingual
Biber and Jones (2005) Written Corpus-based Sample Yes Yes Monolingual
McIntyre et al. (2004) Spoken Corpus-based Sample Yes Mostly[2] Monolingual
Hardie and McEnery (2003) Spoken Corpus-based Opportunistic No Yes Monolingual
Berglund (2000) Written AND spoken Corpus-based Sample No Yes Monolingual

Notes:

Q1-3) How serious a problem is a failure to replicate a study?

Ideally, replication means just that – an attempt to replicate the original study as closely as possible. The data and tools to be used should be the same. If they are not, a failure to achieve replication does not invalidate the original result.

In reality, however, we would be rightly suspicious of a purported linguistic “fact” that changes radically depending on the methods used to examine it. For example, given that at the most fundamental level, all concordancers work in basically the same way, we would not ignore a major discrepancy in results simply because the researcher used a different search tool! We would want to reconcile the original finding and the replication attempt by trying to find an explanation of the difference between the two studies – that is, a clear understanding of how the differences in the nature of the corpus and/or the programs lead to the differences in results.

If we cannot find such an explanation, then we might very well be justified in withholding judgement until further work is done on the question. When an issue is contested, we would ideally continue to reserve judgement until enough work has been done that the preponderance of the evidence clearly leans one way or the other.

The last question we asked you to consider – “How should we decide to apportion our efforts between replicating existing results versus establishing new results?” – is a very difficult issues. In principle replication is very important, and we would attempt to replicate every finding before accepting it.

However, in practice, researchers are unlikely to want to undertake a research project whose sole purpose is to apply the precise same methods as an earlier project just to make sure that the results come out the same. Such a study would probably only be publishable if it was already strongly suspected that the original study was flawed, or if the issue is one where there is already a degree of debate.

For this reason, most studies which you will see in the literature that replicate earlier work also go beyond a straightforward replication in some way – by varying the methods or data in an attempt to show that the original findings are robustly generalisable, for instance.

 
Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom