by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Answers to exercises: Chapter Eight practical activities

A8-1) Do participants given a questionnaire on collocation produce the same phraseologies and idioms that we would find in a corpus-based collocation analysis?

A questionnaire could elicit collocations in several ways. Let’s assume, for the moment, a straightforward open-ended question for each verb, for instance:

Think about the word RUN. When you use the word RUN, what kind of things often come straight after it in a sentence? Please write down as many as you can think of.

Your participants probably would not come up with the same collocations that you used to identify collocates immediately after each verb. There will be some overlap between the survey data and the corpus-derived collocations, but the overlap will not be complete. There are very likely to be a lot more combinations and phraseologies in the data from the corpus; moreover, the combinations that are very obvious in the survey data may not be the most frequent or strongest collocations according to the corpus statistics.

For instance, here’s what one single person came up with, given thirty seconds and the question about run as above.

Conversely, here are the statistically strongest R1 to R3 collocates of the lemma run from the BNC:

What are the reasons for these differences? Answering that question in detail could fill up a PhD thesis or several academic papers! But in brief, we can identify two likely factors.

One potential problem of this kind of methodology is making the analysis of the questionnaire data tractable. There are very few limits on what participants can come up with, and it’s possible to get very diverse responses. Obviously, if every response is entirely unique, it’s then very difficult to draw conclusions about trends or patterns. The most obvious way around this would be to increase the sample size of participants until it is possible to detect trends or patterns in the response. Another way would be to adopt a less open-ended methodology. For instance, instead of asking the participants simply to list whatever they could think of, they could be prompted with patterns extracted from the corpus, and asked ot provide an intuitive judgement as to whether each is a usual phraseology or not. This method produces easily quantifiable and analysable data at the cost of excluding the possibility of answers that come as a surprise. This is the usual trade off between open-ended versus non-open-ended survey methods.

A8-2) Investigate the language acquisition data to be found on the CHILDES website.

The point of this exercise is not the particular things that you find out, but rather to give you a sense of the big gap that exists between the typical data formats and analysis software you are by now familiar with, and those that have emerged as central to developmental psycholinguistics, namely the CHAT format and CLAN software used within the CHILDES project.

You will almost certainly have found that most concordancers other than CLAN cannot do anything sensible with the CHAT format of annotations on parallel tiers represented by separate lines. General-use concordancers, if they are aware of annotation at all, typically assume that a word and its tags are adjacent in the file format. In CHAT, the tags that belong to a particular word have to be found by counting the right number of tokens on the line or lines below the word. True, it is computationally not terribly difficult to automatically convert CHAT tiers to a token-by-token representation, which could then be processed by annotation-aware software such as Corpus Workbench (if represented as columnar data ) or Xaira (if represented as XML). But the original CHAT files themselves are not readily amenable to analysis using annotation-neutral software such as WordSmith and AntConc which can normally directly process raw corpus files. Our experiences of working with CHILDES data is that you have two basic options – either reformat the data completely, or use CLAN!

The specific exercise we suggested, looking at baby-talk words, will have different results depending on the file(s) you’ve chosen and especially the age of the child. Certainly baby-talk is characteristic of adult speech to young children. One interesting thing is that baby-talk can be found in the language of parents addressing pre-linguistic babies – so it is not simply an effect of caregivers imitating linguistic errors or simplifications by their charges!

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom