by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Answers to exercises: Chapter Eight discussion questions

Q8-1) What implications does the established cognitive relevance of the mutual information statistic for collocations have for how corpus linguists should study collocation?

It would probably be going too far to abandon all other collocation statistics. Mutual information is a measure of collocation strength, so it is not a surprise that it should correspond to intuitive ideas of what does or does not “count” as a single unit; other statistics, such as log likelihood, are instead measurements of how sure we can be that a collocation is not spurious, and they therefore have a high degree of value for practical purposes even if they do not match up to our intuitive sense of what is or is not a strong link. Moreover, we must beware of jumping to conclusions about the fundamental nature of how collocation works in the brain or mind on the basis of intuition about what is or is not a valid collocation. As is well established, speakers’ intuitions about how they use language do not necessarily match up with the reality.

That said, the notion that what matters for psycholinguistic purposes is the strength of a collocation – in terms of the likelihood of element Y occurring given that element X has occurred – is an important insight, if it proves to be generally valid. It would fit very well with models of processing, such connectionism or Lexical Priming, which essentially conceptualise language production and reception as transitions across a network of nodes with probabilistically weighted links between them. In a model based on transitions, the global likelihood of combination X-Y occurring is not important – what matters, given that we are already at node X, is how strongly weighted the path from X to Y is. However, to fully account for collocation in this kind of view of language processing, we cannot think simply in terms of word 1 transitioning to word 2 transitioning to word 3 ... et cetera. We also need to consider how more abstract levels of collocational behaviour can be treated within such a model – such as Hoey’s nesting, or collocational pairs where X and Y need not be directly adjacent and can occur in either order, or the phenomena of colligation, semantic preference, and semantic prosody. Much research remains to be done on this front.

Q8-2) Is the corpus a good enough representation of an individual speaker’s lifetime of linguistic experience?

... and if not, what are the consequences for psycholinguistic experiments that use corpus-derived frequencies?

As a first approximation, we don’t consider it unreasonable to treat something like the BNC or Brown Corpus as representative of a speaker’s overall lifetime exposure. However, when we get beyond first approximations, there are clear issues, such as the weighting of speech versus writing (with speech at 10% in the BNC and 0% in Brown – surely not a good match for linguistic experience, especially not the most formative experience in the early years of life) and the weighting of different genres (since there is surely massive variation across individuals here). Any corpus is probably better than no corpus as a representation of an individual’s linguistic experience. But whether a corpus is “good enough” is a much more difficult call, and surely depends largely on the task at hand.

If psycholinguistic experiments use corpus frequencies which do not properly reflect the average speaker’s linguistic experience, then there is a risk of real effects being missed. To expand on this: let’s assume that some aspect of linguistic behaviour is being measured by the experiment, and then assessed for how well it correlates to the frequency of some corresponding feature observed in the corpus. Furthermore, let’s assume that there really is a correlation. If the corpus is in fact not a good match for the participants’ linguistic experience, then there will be appear to be less similarity between the experimental measurements and the corpus frequencies than there should be. In other words, there will be more statistical “noise” around the “signal” of the actually-existing effect. If the noise is too great, then the signal may be lost – it may fail to be statistically significant, or even to be observable at all. The risk, then, is of failing to find evidence for real frequency effects in experimental settings.

This risk is probably not great. Both corpus analysis and experimental psycholinguistics use statistical methods that expect a certain amount of noise – and the amount of noise required to completely drown out an effect is would be very large. For example, if we are testing collocational links, noisiness arising from a corpus that does not match the participants’ linguistic experience might make us miss a cross-paradigm correlation for a low-frequency or low-strength collocation, but it would be quite a stretch to imagine that very strong or very frequent collocations could be missed. To take one of Sinclair’s examples, it would be very odd indeed for any reasonably general English corpus to fail to provide evidence for the collocation naked eye to which experimental measurements on native speaker perception of that collocation could be correlated.

Is the kind of representativeness a corpus needs to be used as a proxy for some speaker’s whole experience the same kind of representativeness that is aimed for in the design of general corpora such as the BNC?

Almost certainly, it is not. Consider some of the comments we made above about individual variation in exposure to different genres, to begin with. Consider furthermore issues such as the balance of ephemeral text types (e.g. news) versus text types expected to be of persistent cultural standing (e.g. fiction) – while there is no single answer to how we ought to go about balancing these, the answer is almost certain to be different from the perspective of an overall sample of the language variety than it is from the perspective of one individual’s typical exposure.

Consider, finally, whether it is even possible in principle for a corpus design to weight different types of language appropriately, relative to an individual’s linguistic exposure. We do not simply pile all the language that flows into our ears onto a single, undifferentiated pile labelled “linguistic experience”: social context, genre and register matter (this point is made effectively, though in different terms, by Hoey 2005 in his explanation of the role of language exposure in Lexical Priming). The experience of reading a newspaper almost certainly does have some incremental effect on our language system, but it is clearly not the same incremental effect as the experience of hearing a similar-sized chunk of language in conversation with a close friend. But in corpus design, although we can record such contextual, genre and register distinctions, the basic decisions we must make are “Is this type of text in or out?” and “How much of each type should go in?” – which, if the corpus is subsequently treated as a unit, leave the contextual distinctions behind. Perhaps, then, future research mapping psycholinguistic experimental measurements to corpus frequency should take more account of genre and context effects in choosing appropriate corpora or subcorpora as points of comparison.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom