Answers to exercises: Chapter Four discussion questions
Q4-1) How important is it for learner corpora to be balanced across genre and/or text-type?
The choice of the essay-style composition as the text-type of choice for many learner corpora is undoubtedly goal-oriented. That is, if the goal of analysing learner corpora is ultimately to help learners improve their performance at writing tasks, then it makes sense to build corpora of the type of writing that they are usually tasked with producing!
In its own terms, this argument is perfectly respectable, and without any doubt learner corpora based on this kind of sampling frame have been of very great practical utility. Yet ultimately, we will want to widen the horizons of learner corpus research. The first work to address the homogeneity of learner corpora focused on addressing the lack of spoken data; see, for instance, the LINDSEI corpus (which we discuss on p. 82).
Beyond considering just the parameter of speech versus writing, what kind of text-type distribution might we be interested in seeing in an ideal learner corpus design?
A more ideal learner corpus would contain output from learners in a range of contexts, for example: classroom-based and non-classroom-based; formal and informal; assessed and non-assessed. It would also consider the range of different types of learner – not just university learners but also those picking up the language from a community around their own (as happens, for instance, in many immigrant communities in English-speaking countries). And it would additionally include directly-comparable samples of native speaker language.
Such a corpus would give us a much more powerful explanatory framework for approaching learner language. However, it would also be far harder to build! So perhaps, rather than anticipate a future perfect learner corpus, we should expect the different dimensions of this “ideal” corpus to be addressed piecemeal. The Louvain team's development of LINDSEI, LOCNESS and LOCNEC allows analysis contrast writing and speech on the one hand, and of native speaker language versus learner language on the other. Future projects may address other aspects of the “ideal” model – for instance, writing produced as part of a with-stakes examination as compared to writing produced outside the context of assessment.
Q4-2) How might corpus linguistics have been different, if some language other than English had been its hatching-ground?
This question probably has as many answers as there are human languages. But let us consider a few areas that have attracted a great deal of research, particularly at the level of lexis and/or grammar, which would be far less salient if some other language had been central.
- Modal verbs. As we outline in Chapter 5, there has been a lot of corpus-based work on modal verbs and how their use varies (a) across text types and (b) over historical time. But the existence of a class of nine modal verbs that are inflectionally completely distinct from all other verbs – both lexical and auxiliary – is an oddity of English. Even a closely related language such as German does not share it (the German cognates of the English modal verbs do not share the main structural peculiarities of English modals, i.e. that they cannot be non-finite and never change their inflection). More distantly related and unrelated languages organise modality completely differently, and often as a set of verbal inflections rather than as separate auxiliary elements.
- Verb complementation. English relies heavily on word order to indicate grammatical roles such as direct object, indirect object, etc.; and there has been much corpus-based research into, for instance, the variation between the double-object or ditransitive construction and the prepositional-object construction. Some of this research is reviewed in chapters 5 and 7. In a language which used other means of indicating grammatical relations (e.g. case marking or verb agreement marking) this kind of study would not have become so prominent.
- N-grams. Much research into formulaicity of language – including Biber's Lexical Bundles, or that subset of approaches to collocation which operationalise collocation in terms of “clusters” (see chapter 6) – has used automatically-extracted frequency lists of n-grams as its main method. Yet this is only a productive approach to formulaicity because (a) English word order is rather fixed, as noted above; (b) English words have rather few inflectional variants. In a language like Latin, Russian or Arabic, where a single lemma can have dozens of inflectional forms and word order is much more flexible, there would be correspondingly less reason to expect n-grams to be a good method for assessing formulaicity.
- Generally, the centrality of the word. Many techniques within corpus linguistics take the individual word as central – consider, for instace, the statistics that underlie the analysis of collocations and keywords. But the word is by no means conceptually identical across languages. In many languages of North America, for instance, multiple lexical roots can be incorporated into a single word as a grammatical, rather than derivational, process. A hypothetical corpus linguistics that emerged from the study of Greenlandic, for instance, would be much less word-centric than corpus linguistics as it exists in fact!
In languages such as those mentioned above (German, Latin, Russian, Arabic and Greenlandic), the contrasts with English arise because English has a lot less inflectional grammar and relies a lot more on word order. But there are language which contrast with English by having even less inflectional grammar. The various dialects of Chinese are a good example here. Chinese raises another issue, which is the prominence (and difficulty) of different kinds of annotation. For English, tokenisation is easy because words are delimited by spaces; POS tagging requires many categories to cover different inflectional forms; and lemmatisation is necessary to level out inflection where it is not wanted in the analysis. In Chinese, there is no inter-word whitespace, and thus tokenisation (also known as word segmentation for Chinese) is both technically challenging and indispensible to the analysis; POS tagging does not need to deal with inflectional subcategories of the major word-classes, but only with functional subcategories; and there is so little inflection that lemmatisation is basically irrelevant.
Q4-3) How might you go about formulating a descriptive grammar of the spoken language?
You will probably discover a need to innovate quite quickly in thinking through this question. Spoken language is a challenge for linguists, and we would generally contend that the practice of basing an analysis solely or predominantly on written language is more prevalent than it should be in corpus linguistics.
Particularly at the grammatical level of analysis, what initially looks like dysfluency or disorganization in spoken language is often revealed, when looking at sufficient data, to be an unexpected regularity that demands a new label and description. It was this confrontation with speech that inspired researchers such as Brazil, Carter and McCarthy. We do not have scope here to give a more detailed review of these authors than that already given on pp. 84 to 88 of our book, but we encourage readers with an interest in grammar to read, in particular, McCarthy and Carter (2001) alongside another grammar with a more traditional focus on the written language. It will, we predict, be an enlightening experience!
In terms of particular procedures for developing a grammar of speech, we would recommend beginning by cataloguing grammatical processes observable in the spoken data that are not observed, or only rarely observed, in writing; as the catalogue of structures and processes grows, it will become possible to generalise and, ultimately, theorise about the status of the grammar of speech.