Answers to exercises: Chapter One discussion questions

Q1-1) Balance and representativeness in corpus design

We asked you to consider the design of a hypothetical corpus, thinking particularly about these questions:

Is it balanced?
Is it representative?
Can these claims be made for any corpus sampling frame in an absolute sense, or must they always be qualified?

Our answer: Claims about balance and representativeness must always be qualified, certainly for corpora of this (relatively small) scale. Considering this corpus design in particular, although it covers a fair range of material, there are biases towards certain types of material.

For example, all the spoken data is broadcast speech. Is this representative? Well, if broadcast speech is what the corpus-builders wanted to represent, then in that limited sense we the corpus-builders say that the corpus is representative, perhaps even balanced. But it it is not generally representative of speech, as important types of speech (such as informal, colloquial dialogue) are just completely missing. So one important qualification about representativeness is that it is always relative to what you are trying to accomplish in the first place, and the scope of what you are investigating.

We could, perhaps, be slightly more confident in calling this corpus representative of written data – since it includes many different types of writing, both published and unpublished, though again not every type of writing is represented. What would be harder would be to defend the written part of this corpus as balanced. Look at the sizes of each section of the corpus and consider them proportionally. There are very great differences in how big each part of the corpus is. The question is, where did these numbers come from? what motivated them? Why, for instance, is there five times as much fiction as there is academic writing? At some point in the design of this corpus, someone made a decision that the corpus would better represent the language as a whole if it had five times the amount of fiction as academic writing. But exactly how could you justify such a decision? You might say that a lot more people read fiction than read academic writing. But is readership the important factor for a balanced corpus – might there not be other factors? Inevitably the choice will have been to at least some degree subjective, goakl-oriented, and thus subject to dsisagreement.

And in fact, it is always possible to criticise decisions about balance in corpus design – even if the balance is as straightforward as possible. In this corpus's spoken data, for instance, there is a 50-50 split between news broadcasts and talk shows. We could criticise this, however, on any number of grounds. We might say the lots more people watch the news than watch talk shows, so the news broadcast language should have a higher level of representation than the talk shows. Or, alternatively, we might say that talk shows are more often unscripted and unplanned than new broadcassts, so the corpus really ought to contain more talk shows, since they are closer to unplanned, spontaneous speech. Again, which of these arguments you wish to make depends very much on what you are trying to accomplish with the corpus, i.e. what you want the corpus for.

So while this would undoubtedly be a very useful corpus to have, and could in a qualified sense be called balanced and representative, it is not in any unqualified sense balanced and representative. Indeed it is difficult to think of a corpus in existence for which we could make such an unqualified claim.

Q1-2) Have a look at three or four research papers from the recent primary literature on corpus linguistics, and consider where each study stands, relative to the different criteria introduced in Chapter One.

Here are the links to the papers we suggested as reading for this discussion question. Note that some, but not all, are open-access – but hopefully given the length of the list any reader will be able to get hold of at least a subset of these materials.

These papers show the breadth of corpus linguistics with regard to the criteria we have asked you to look at. Reading them should allow you to better understand the oppositions that we have used in this chapter to give the subject of corpus linguistics some shape.

The table below gives a very quick overview of what we think the “answers” are in each case, but note that on many of these points, there is much more to be said than simply pigeonholing each study into one category or the other!

	Mode of language	Main approach	Corpus type	Annotation?	Total accountability?	Mono / multilingual?
Culpeper (2009)	Written-to-be-spoken	Corpus-based	Opportunistic[1]	Yes	Yes	Monolingual
Calude (2008)	Spoken	Corpus-based	Sample	No	No	Monolingual
Chung (2008)	Written	Corpus-based	Opportunistic	No	Debatable	Multilingual
Diani (2008)	Written AND spoken	Uncommitted	Sample	No	Yes	Monolingual
Hunston (2007)	Mostly written	Corpus-driven	Mostly monitor	No	Yes	Monolingual
Oakes and Farrow (2007)	Written	Uncommitted	Sample	No	Yes	Monolingual
Inaki and Okita (2006)	Written	Corpus-based	Opportunistic	No	Partially	Monolingual
Biber and Jones (2005)	Written	Corpus-based	Sample	Yes	Yes	Monolingual
McIntyre et al. (2004)	Spoken	Corpus-based	Sample	Yes	Mostly[2]	Monolingual
Hardie and McEnery (2003)	Spoken	Corpus-based	Opportunistic	No	Yes	Monolingual
Berglund (2000)	Written AND spoken	Corpus-based	Sample	No	Yes	Monolingual

Notes:

[1] It may seem odd to describe the text of a play as an “opportunistic” corpus for the study of that play. However, using all the available words spoken by particular characters necessarily counts as opportunistic; the corpus has not been designed to representatively sample each character's speech, so therefore the selection is of all that was available – the definition of an opportunistic corpus.
[2] This study is totally accountable to the corpus it creates. However, this corpus is a targeted subset of two larger corpora, and since it is targeted, that means the results of the study are not totally accountable to the larger corpora.

Q1-3) How serious a problem is a failure to replicate a study?

Ideally, replication means just that – an attempt to replicate the original study as closely as possible. The data and tools to be used should be the same. If they are not, a failure to achieve replication does not invalidate the original result.

In reality, however, we would be rightly suspicious of a purported linguistic “fact” that changes radically depending on the methods used to examine it. For example, given that at the most fundamental level, all concordancers work in basically the same way, we would not ignore a major discrepancy in results simply because the researcher used a different search tool! We would want to reconcile the original finding and the replication attempt by trying to find an explanation of the difference between the two studies – that is, a clear understanding of how the differences in the nature of the corpus and/or the programs lead to the differences in results.

If we cannot find such an explanation, then we might very well be justified in withholding judgement until further work is done on the question. When an issue is contested, we would ideally continue to reserve judgement until enough work has been done that the preponderance of the evidence clearly leans one way or the other.

The last question we asked you to consider – “How should we decide to apportion our efforts between replicating existing results versus establishing new results?” – is a very difficult issues. In principle replication is very important, and we would attempt to replicate every finding before accepting it.

However, in practice, researchers are unlikely to want to undertake a research project whose sole purpose is to apply the precise same methods as an earlier project just to make sure that the results come out the same. Such a study would probably only be publishable if it was already strongly suspected that the original study was flawed, or if the issue is one where there is already a degree of debate.

For this reason, most studies which you will see in the literature that replicate earlier work also go beyond a straightforward replication in some way – by varying the methods or data in an attempt to show that the original findings are robustly generalisable, for instance.

Website contents

Answers to exercises: Chapter One discussion questions

Q1-1) Balance and representativeness in corpus design

Q1-2) Have a look at three or four research papers from the recent primary literature on corpus linguistics, and consider where each study stands, relative to the different criteria introduced in Chapter One.

Q1-3) How serious a problem is a failure to replicate a study?