by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012
 

Answers to exercises: Chapter Five discussion questions

Q5-1) The difficulties of text-sampling for balance and representativeness in long-period diachronic corpora

As the text of this discussion task notes, even over relatively short periods and relatively similar cultures, genres can be quite variable. Genres emerge, flourish and decline fairly quickly – often more quickly than the changes in the language which we wish to study in a corpus that is blaanced for genre!

Consider, for instance, the online genres that have emerged in just the past twenty years – email, blogs, online fora, synchronous chat, and so on. Some of these genres have pre-internet parallels (for example, the email as a genre clearly emerges from the genre of the letter); but it would be a very great stretch indeed to ignore the differences between modern online genres and their pre-internet counterparts.

How then do we approach the issues of balance and representativeness? If we want to make our diachronic corpus identically balanced at every temporal stage, then we have no choice but to include only the most long-lived genres, and to ignore genres which either are relatively new (such as all the online genres) or which had relatively short lifespans. However, if we do that, our corpus is no longer representative of the language of each period as a whole. Rather, it is representative of a subset of genres chosen not because they are typucal but because they are stable.

What is the way out of this dilemma? Basically, it depends on the purpose of the corpus. If it is intended to be used for the comparison of grammatical frequencies across genre and across time, then clearly it is less bad to omit the short-lived or recently-emerged genres. (The ARCHER corpus is a good example of a corpus constructed with this kind of purpose in mind.) On the other hand, if our goal is to construct a corpus that represents English at each period as broadly as possible, then we would probably want to shift the sampling frame over time to account for shifts in what genres (a) exist and (b) are important.

To put it another way: where you decide to follow a sampling frame rigidly, and where you decide to employ leeway, is a function of your research goals.

Beyond these conceptual issues, a host of purely practical difficulties bedevil this kind of enterprise. We can only mention a few here. First of all, the further back in time you go, the more difficult it is to acquire already-existing machine readable texts. While scans of pages may be available, making useable corpus data from page scans requires a great deal of effort. Optical character recognition software has a non-negligible error rate, which tends to be higher for older materials where the legibility of the print may have deteriorated over time; on the other hand, manual re-typing is very labour-intensive. Also, if you wish to seek copyright clearance on a text, the older the text is (while still in copyright) the more difficult it becomes, generally, to track down the owner of the rights to the text. All these issue make it much more challenging to collect balanced, representative corpora for specified periods in the apst than it is to sample the present-day language.

Q5-2) Is it problematic that Biber's MD method is largely limited to features at the lexical, morphological and syntactic levels?

It's worth noting, first of all, that nothing in the MD method itself actually does necessarily limit it to the use of lexicogrammatical features. It's certainly true that Biber's studies to date have all used a list of purely lexicogrammatical features, because lexicogrammar is where Biber's and his colleagues' interests lie. However, you could easily in principle compile a list of rhetorical and pragmatic features, extract frequency counts for these features across a corpus of texts from many genres, and then apply Biber's statistical methods to see what dimensions emerge. (Note however that, although easy in principle, this would be very difficult in practice because pragmatic and rhetorical features are by their nature very difficult to search for automatically; manual tagging of very large bodies of text would almost certainly be necessary.) The mathematical procedure would be the same regardless of the features from which the input frequencies have been collected.

That said, to our knowledge no one has yet attempted such a study. This is a pity, as in fact many of the functional analyses that Biber assigns to the dimensions that emerge from his statistics can be seen as matters of rhetorical purpose (for example, the narrative or persuasive functions). In theory we would expect to see related pragmatic/rhetorical features appearing on the same dimensions as the low-level lexicogrammatical features that have the typical function of expressing those purposes or functions. That said, it would be very nice to have this assumption empirically confirmed! In general, further empirical research into how the definition of the feature list affects the ultimate dimensions that emerge is still needed.

Q5-3) What kind of speaker-level metadata is desirable in a corpus to be used for variationist-sociolinguistic analysis?

In principle you would wish to collect as much information as possible – a wide range of metadata on any factor about the person which might possibly correlate with any aspect of linguistic variation. Speaker age, speaker sex, the social class of the speaker, their geographical origin and a list of places they have lived might be a sensible start. But in principle there is no limit to how detailed we might get.

For instance, social class is usually considered to be related to occupation. So, in how much detail do we need to record the occupation of each speaker? Do we describe someone as an clerical/managerial worker, or an accountant, or a senior accountant, or a tax accountant, or a senior tax accountant, or a senior tax accountant for an international corporation ... ? All of these are in principle relevant to this person's social status and power relations with other speakers. However, the more detailed the data, the less tractable it becomes. We typically want to use the metadata to group speakers together. But if our description of each person's occupation is maximally detailed, there will not be any meaningful groups. There will be many clerical/managerial workers in the corpus, and maybe even several accountants. However, there is probably only one senior tax accountant for an international corporation. So such a label in the metadata is perhaps not the most useful way to record occupational status. Broader-brush categories may be preferable. That said, for sociolinguistic purposes it is probably a bad idea to collapse all class variation into four categories (AB/C1/C2/DE), as the spoken section of the BNC does!

Beyond characteristics of individual speakers, it would also be useful to examine the social relationships in the data – social networks are often an important part of explanations for language change in variationist sociolinguistics. So, it would be a good idea to include some representation of family relationships or friendship networks as part of the metadata for each speaker. Precisely how that is represented depends in large part whether the friends and family in question are also speakers in the corpus or not!

Needless to say, the more data that we gather on individual speakers, the more acute ethical concerns become, as the more the participants are sacrificing their privacy when they agree to participate. Notably, when social metadata becomes extremely detailed, it might be possible for speakers in the corpus to be identified by people who know them in real life, even if their names, addresses, etc. have been thoroughly anonymised in the corpus data and metadata. Another ethical concern is that the more detailed the metadata, the more likely it is that one or more of the questions asked of speakers, for example those mapping the friendship networks, will also require them to reveal embarrassing facts, e.g. who they like and dislike. Avoiding such speaker discomfort where possible is a matter of good research ethics.

In addition to speaker-level metadata, we would also wish to gather information on other elements of the social context in which the speech is being produced. Is it a public or a private setting? Formal or informal? Is the interaction task-focused? In short, there is a very broad range of metadata that could help us begin to understand and account for the variables which might influence the linguistic forms used.

 
Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom