by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Answers to exercises: Chapter Four discussion questions

Q4-1) How important is it for learner corpora to be balanced across genre and/or text-type?

The choice of the essay-style composition as the text-type of choice for many learner corpora is undoubtedly goal-oriented. That is, if the goal of analysing learner corpora is ultimately to help learners improve their performance at writing tasks, then it makes sense to build corpora of the type of writing that they are usually tasked with producing!

In its own terms, this argument is perfectly respectable, and without any doubt learner corpora based on this kind of sampling frame have been of very great practical utility. Yet ultimately, we will want to widen the horizons of learner corpus research. The first work to address the homogeneity of learner corpora focused on addressing the lack of spoken data; see, for instance, the LINDSEI corpus (which we discuss on p. 82).

Beyond considering just the parameter of speech versus writing, what kind of text-type distribution might we be interested in seeing in an ideal learner corpus design?

A more ideal learner corpus would contain output from learners in a range of contexts, for example: classroom-based and non-classroom-based; formal and informal; assessed and non-assessed. It would also consider the range of different types of learner – not just university learners but also those picking up the language from a community around their own (as happens, for instance, in many immigrant communities in English-speaking countries). And it would additionally include directly-comparable samples of native speaker language.

Such a corpus would give us a much more powerful explanatory framework for approaching learner language. However, it would also be far harder to build! So perhaps, rather than anticipate a future perfect learner corpus, we should expect the different dimensions of this “ideal” corpus to be addressed piecemeal. The Louvain team's development of LINDSEI, LOCNESS and LOCNEC allows analysis contrast writing and speech on the one hand, and of native speaker language versus learner language on the other. Future projects may address other aspects of the “ideal” model – for instance, writing produced as part of a with-stakes examination as compared to writing produced outside the context of assessment.

Q4-2) How might corpus linguistics have been different, if some language other than English had been its hatching-ground?

This question probably has as many answers as there are human languages. But let us consider a few areas that have attracted a great deal of research, particularly at the level of lexis and/or grammar, which would be far less salient if some other language had been central.

In languages such as those mentioned above (German, Latin, Russian, Arabic and Greenlandic), the contrasts with English arise because English has a lot less inflectional grammar and relies a lot more on word order. But there are language which contrast with English by having even less inflectional grammar. The various dialects of Chinese are a good example here. Chinese raises another issue, which is the prominence (and difficulty) of different kinds of annotation. For English, tokenisation is easy because words are delimited by spaces; POS tagging requires many categories to cover different inflectional forms; and lemmatisation is necessary to level out inflection where it is not wanted in the analysis. In Chinese, there is no inter-word whitespace, and thus tokenisation (also known as word segmentation for Chinese) is both technically challenging and indispensible to the analysis; POS tagging does not need to deal with inflectional subcategories of the major word-classes, but only with functional subcategories; and there is so little inflection that lemmatisation is basically irrelevant.

Q4-3) How might you go about formulating a descriptive grammar of the spoken language?

You will probably discover a need to innovate quite quickly in thinking through this question. Spoken language is a challenge for linguists, and we would generally contend that the practice of basing an analysis solely or predominantly on written language is more prevalent than it should be in corpus linguistics.

Particularly at the grammatical level of analysis, what initially looks like dysfluency or disorganization in spoken language is often revealed, when looking at sufficient data, to be an unexpected regularity that demands a new label and description. It was this confrontation with speech that inspired researchers such as Brazil, Carter and McCarthy. We do not have scope here to give a more detailed review of these authors than that already given on pp. 84 to 88 of our book, but we encourage readers with an interest in grammar to read, in particular, McCarthy and Carter (2001) alongside another grammar with a more traditional focus on the written language. It will, we predict, be an enlightening experience!

In terms of particular procedures for developing a grammar of speech, we would recommend beginning by cataloguing grammatical processes observable in the spoken data that are not observed, or only rarely observed, in writing; as the catalogue of structures and processes grows, it will become possible to generalise and, ultimately, theorise about the status of the grammar of speech.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom