Website contents

Answers to exercises: Chapter Eight practical activities

A8-1) Do participants given a questionnaire on collocation produce the same phraseologies and idioms that we would find in a corpus-based collocation analysis?

A questionnaire could elicit collocations in several ways. Let’s assume, for the moment, a straightforward open-ended question for each verb, for instance:

Think about the word RUN. When you use the word RUN, what kind of things often come straight after it in a sentence? Please write down as many as you can think of.

Your participants probably would not come up with the same collocations that you used to identify collocates immediately after each verb. There will be some overlap between the survey data and the corpus-derived collocations, but the overlap will not be complete. There are very likely to be a lot more combinations and phraseologies in the data from the corpus; moreover, the combinations that are very obvious in the survey data may not be the most frequent or strongest collocations according to the corpus statistics.

For instance, here’s what one single person came up with, given thirty seconds and the question about run as above.

run fast
run away
run quickly
run a race
run a marathon
run off
run for

Conversely, here are the statistically strongest R1 to R3 collocates of the lemma run from the BNC:

using log likelihood: out, away, through, down, into, to, off, along, up;
using mutual information: amok, aground, 8mhz, errands, gauntlet, gamut, concurrently, thither, 20mhz.

What are the reasons for these differences? Answering that question in detail could fill up a PhD thesis or several academic papers! But in brief, we can identify two likely factors.

What participants come up with in the survey will probably be the most cognitively salient combinations, as opposed to those that are most frequent or that are statistically strongest in a corpus. Salience and frequency do not necessarily coincide – indeed, highly frequent things may be easy to overlook precisely because they are so common, they go unnoticed. Our informant came up with run away, where the prepositional adverb is highly salient as a complete unit modifying the meaning of the verb phrase overall; but not with run to, also a strong log-likelihood collocation, where the preposition functions to introduce a subsequent element (the place towards which the running is directed); perhaps, as an incomplete unit, this collocate is in itself less salient.
As we’ve noted in the discussion of exercises from earlier chapters, it is typical for a given phraseology to associate with a specific sense of a word, and not with other senses. If the stimulus puts one particular sense of that word into the mind of a participant, then they are quite likely to think of phraseologies associated with that sense, but not phraseologies associated with other senses. We can see this in the results for run above – run is, of course, highly polysemous. All the examples our informant came up with relate to the “physical motion” sense, not to the computer-related sense of “execute a program” which is related to the mutual information collocates of 20mhz and 8mhz, for example or the “take place over a period of time” sense with which concurrently co-occurs.

One potential problem of this kind of methodology is making the analysis of the questionnaire data tractable. There are very few limits on what participants can come up with, and it’s possible to get very diverse responses. Obviously, if every response is entirely unique, it’s then very difficult to draw conclusions about trends or patterns. The most obvious way around this would be to increase the sample size of participants until it is possible to detect trends or patterns in the response. Another way would be to adopt a less open-ended methodology. For instance, instead of asking the participants simply to list whatever they could think of, they could be prompted with patterns extracted from the corpus, and asked ot provide an intuitive judgement as to whether each is a usual phraseology or not. This method produces easily quantifiable and analysable data at the cost of excluding the possibility of answers that come as a surprise. This is the usual trade off between open-ended versus non-open-ended survey methods.

A8-2) Investigate the language acquisition data to be found on the CHILDES website.

The point of this exercise is not the particular things that you find out, but rather to give you a sense of the big gap that exists between the typical data formats and analysis software you are by now familiar with, and those that have emerged as central to developmental psycholinguistics, namely the CHAT format and CLAN software used within the CHILDES project.

You will almost certainly have found that most concordancers other than CLAN cannot do anything sensible with the CHAT format of annotations on parallel tiers represented by separate lines. General-use concordancers, if they are aware of annotation at all, typically assume that a word and its tags are adjacent in the file format. In CHAT, the tags that belong to a particular word have to be found by counting the right number of tokens on the line or lines below the word. True, it is computationally not terribly difficult to automatically convert CHAT tiers to a token-by-token representation, which could then be processed by annotation-aware software such as Corpus Workbench (if represented as columnar data ) or Xaira (if represented as XML). But the original CHAT files themselves are not readily amenable to analysis using annotation-neutral software such as WordSmith and AntConc which can normally directly process raw corpus files. Our experiences of working with CHILDES data is that you have two basic options – either reformat the data completely, or use CLAN!

The specific exercise we suggested, looking at baby-talk words, will have different results depending on the file(s) you’ve chosen and especially the age of the child. Certainly baby-talk is characteristic of adult speech to young children. One interesting thing is that baby-talk can be found in the language of parents addressing pre-linguistic babies – so it is not simply an effect of caregivers imitating linguistic errors or simplifications by their charges!