by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Answers to exercises: Chapter Six discussion questions

Q6-1) What are the differences between how “pre-corpus” and “post-corpus” dictionaries characterise and present information about words?

The corpus-informed dictionary will doubtless contain information that the pre-corpus dictionary could not reasonably have had access to. One important example of this is relative frequency information for the different parts-of-speech and/or senses of na word. This will probably not be present explicitly, but is very likely to inform the order in which different uses of a word are presented.

Depending on the word, it's not unlikely that the “post-corpus” dictionary will contain some meanings or uses that were not present in the “pre-corpus” dictionary. Of course, these might be new senses which have only recently emerged. But there's also a chance that they are meanings which pre-corpus lexicography simply overlooked.

Beyond a simple tally of distinct senses or uses, the account of each sense given by a corpus-informed dictionary is likely to show a clearly higher level of descriptive adequacy than those in the pre-corpus dictionaries. Most critically, the accompanying usage examples of the word in (part of) a sentence will tend to be more naturalistic, as they are probably actual examples from the corpus (possibly edited somewhat). With the best will in the world, writing a dictionary entry based on the lexicographer's intuition will tend to generate usage examples that are very artificial, purely because the kind of language we produce when we're thinking explicitly about the meaning and use of a word is very different to the kind of language we produce when we are not engaged in this kind of metalinguistic reflection.

Some corpus-informed dictionaries, especially those whose compilers have been strongly inflienced by neo-Firthian ideas, may also place a much stronger emphasis on the phraseology of words than did pre-corpus dictionaries. This is of a piece with the neo-Firthian principle that the senses of a word are inherently interlinked with the different collocational patterns in which it occurs. So, for instance, the complementation or subcategorisation patterns of a noun, verb, or adjective would be presented in terms of the characteristic phraseologies that realise those complementation patterns. This kind of lexicographically-oriented phraseological analysis is what led, ultimately, to the development of Pattern Grammar.

Q6-2) What are the pros and cons of studying collocation on the basis of lemmata rather than wordforms?

The pros of studying lemmata are twofold. There is a principled advantage, and there is a practical advantage.

The principled advantage is based on the idea that we already know that lemmata are, in some way, a real feature of the language system. That is, it is a “fact” that walk and walking are the same thing in a sense that, say, walk and march are not the same thing. When we think about the usage of “words”, we typically are thinking about words in the sense of lemmata. All this being the case, when we look at the associations of words, it makes sense to group those associations in the same way that we group the forms themselves – that is, at the level of the lemma.

This argument on the basis of principle is precisely what neo-Firthian theory often does not agree with; as we note in Chapter Six, neo-Firthian theory is word-centric and committed to collocation methodologies that begin with the wordforms that actually appear in texts, not with abstract groupings such as the lemma. This is part and parcel of the avoidance, evident to different degrees in the work of different neo-Firthian scholars, of non-corpus-derived or pre-corpus theoretical constructs (of which the lemma is one). Likewise, lemmatisation by its very nature is a form of corpus annotation, which neo-Firthians generally avoid. So Sinclair and those who follow him have consistently objected to basing collocational analysis on lemmata.

But avoiding lemma-based collocation is not merely a theoretical issue, since it is easily demonstrable that different inflectional variants of a single lemma do in fact exhibit different collocational behaviour. This point too was frequently cited by Sinclair and others in support of a wordform-based rather than lemma-based approach to collocation. Here's a simple illustration using the BNC. The top ten collocates of different forms of destroy (L4 to R4, ranked by log-likelihood, minimum frequency 5, minimum frequency of co-occurrence 5):

There are very, very few commonalities across the three lists. Ozone and which both appear twice, but that's about it. Is this, then, the killer-blow argument against lemma-based collocation? Not necessarily. Consider the nature of some of the differences. Destroys collocates with a number of third person pronouns and determiners whereas destroy does not – but this is merely a reflection of what we know already about these inflectional forms, namely that destroys is the third person form. On the other hand, many of the unique collocates of destroy and destroyed are explicable in terms of their specific non-finite functions (which exist in addition to their functions as present and past tense forms). Destroy is also an infinitive, accounting for its co-occurrence with to and a range of modal verbs; destroyed is also a past participle, accounting for its co-occurrence with grammatical words linked to the passive (by, forms of be) and perfect (forms of have).

The grammatical multifunctionality of these forms is important, because it raises the question: if it is wrong to merge together the behaviour of the different forms of the lemma destroy by working on the basis of lemma, which is it not wrong to merge together the behaviour of destroyed as a past participle and destroyed as a past tense by working on the basis of wordform? (Of course, distinguishing past participle from past tense, or inifnitive from present tense, would require POS tags, which in the neo-Firthians approach are typically avoided just as are lemmata.)

That issue aside, the question boils down to this: to what degree are these collocational differences facts about these three word-forms in particular, as opposed to facts about how the grammar of verbs works? If they are just side-effects of things we know already about the grammar system, then they are not particularly interesting, and the use of lemmatisation to focus instead on co-occurrence features that are consistent across all the different inflectional categories, and sweep indlection-dependent differences under the carpet, actually becomes rather appealing. To a large degree your reaction to this dilemma will come down to whether you see the grammatical system as prior to the collocational patterns, or an outgrowth of the collocational patterns – the latter being the typical neo-Firthian viewpoint.

Apart from these theoretical points there is one very substantial practical reason to base statistical collocation analysis on lemmata. This is that, if we are dealing with an infrequent word, we may not have enough statistical evidence to establish a link with each individual wordform, but if we combine all the wordforms together then the higher overall frequency will allow the statistics much more power to detect links.

This is rarely a major issue for English, since the frequency of a lemma – even a verb lemma, which has the greatest range of inflectional forms – is unlikely to be more than four or five times greater than the frequency of its wordforms. Compare, however, languages like Arabic, where a single lemma can have dozens of different inflectional forms (and this affects both verbal and nominal categories). The wordforms within even a reasonably frequent lemma can thus be individually very rare. The rationale for working with lemmata becomes much greater in this case, as may make the difference between getting some results versus getting no results at all.

Q6-3) What are the theoretical commitments inherent in a view of collocation as constrained by syntax?

In section 6.2 we discussed an approach to collocation that only considers two words as collocating with one another if they occur in a specified grammatical relationship to one another for instance, verb and direct object; or head noun and premodifying adjective. In this view of collocation, simple co-occurrence of two words in proximity is not enough. The theoretical commitment implicit in is, basically, the primacy of syntax and grammatical structure: namely, the syntactic relationship comes first, and we can then look for patterns or trends in how lemmata are distributed across the slots in these syntactic relationships. But the abstract syntax remains the overall controlling and structuring principle of the language system.

Since this necessarily implies a grammatical system that exists indepently of, and prior to, the collocational and colligational patterns of individual words, it is not compatible with neo-Firthian theory. Again, the neo-Firthian view of lexicogrammar is that the co-occurrence patterns of particular words are the primary fact, ahnd whatever abstract grammatical system exists is a result of, or a generalisation across, these patterns. Procedurally, starting a collocation analysis with certain assumed grammatical patterns or relations – rather than deriving them from examination of the co-texts of the node under investigation – is also entirely contrary to the neo-Firthian view.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom