Answers to exercises: Chapter Five practical activities

A5-1) Analysing texts relative to Biber's dimensions

The nine features we suggested you search for have varying “levels of difficulty&rdqduo;; many of them require POS tags for a reasonably effective search; lemmas also help in some cases. See the tasks following Chapter Two for some information on this.

We cannot provide search patterns for these nine features for every conceivable concordancer and POS tagset. Instead, here are search patterns that will work using the BNCweb interface (click here to create a BNCweb account if you do not have one). If you have selected texts from the BNC for your analysis, it is easy to create a subcorpus containing just that text, activate that subcorpus on the main query screen, and then run each of the nine searches. Alternatively, if you have texts that you have loaded into AntConc or WordSmith, it should not be too difficult to reformulate these searches into wildcard-based searches or regular expressions.

Feature	Search term
1st person pronouns	(I\|me\|we\|us)
Present tense verbs	_V?Z*
Contracted verb forms	('s_V*\|'ve\|'re\|'ll\|'d\|'m)
3rd person pronouns other than it	(he\|she\|him\|her_PNP)
Past tense verbs	_V?D*
Perfect aspect	{have} (_{ADV})? (_{ADV})? _V?N*
Passive voice	{be} (_{ADV})? (_{ADV})? _V?N*
Past participial clauses	[*]see note below
The adverbial subordinators since, while and whereas	(since_C\|while_C\|whereas_C*)

[*] This is a difficult search; a workable approximation is to search for _V?N* but to then subtract the frequencies of the passive and perfect (since the past participle only creates a participial clause when it is not used in the passive or perfect!)

Note these are not the only search terms that will produce the correct results; there exist alternatives for nearly all of them (with in some cases greater or lesser amounts of noise in the results).

Regarding your results: as noted in the activity description, your precise findings depend on what texts you have selected. And you may well find some slight differences between your own results and what has been reported in the literature, especially since this analysis relies on only a subset of Biber's features. However, we would generally not expect results that radically contradict Biber's findings. Biber's dimensions have been subjected to widespread testing and use and it is highly likely that your findings will be roughly in line with those he would have predicted.

If this is not the case, look at the data again. Is it atypical data, e.g. a scientific report written for a general rather than scientific audience? (It would not be surprising if such a text showed strong differences to what Biber's methods show for scientific journal articles of the sort we find classed as J-textsin the Brown Family.) If so, the technique has not failed – rather it is directing you to consider more carefully the genre classification you have given to a text.

A5-2) How confident are you that these searches return all and only the examples of the structure you wanted?

For some features we can be pretty confident about all-and-only. For instance, there are four first person pronouns and none of those words has any function other than being a first person pronoun. However, any of the searches which involves a POS tag (which is most of them) will fall short of the all-and-only standard simply due to the error rate of the POS tagger.

Beyond this, there are other limits to how good the searches are. They may occasionally match things other than the desired structure; or, they may miss cases of the desired structure. For instance, if you worked out a search pattern for the perfect yourself, then you probably searched for verb HAVE followed by a past participle. However, this will only capture subset of perfects. It is possible for adverbs to come between the auxiliary and the participle. The more complex pattern given in the table above allows for adverbs, but does not allow for interrogatives. In an interrogative perfect, the subject noun phrase comes between the auxiliary and the past participle. Unfortunately, there is no possible search pattern that can capture all-and-only noun phrases based on just POS tags (there are theoretical reasons for this: basically, regular languages such as regex or wildcard-based search syntax can never capture a potentially recursive structure, and since a noun phrase can contain other noun phrases, it is potentially recursive). So we can never have a true all-and-only search for perfects (or, mutatis mutandis, passives) based on words, POS tags and regex.

We have not, however, considered corpora with full, manually-postedited syntactic tagging – whether dependency parsing or constituency parsing. In such a corpus, it is possible to use the parsing brackets and/or dependency relations to achieve all-and-only level searches for complex grammatical features such as the perfect or the passive. The prime example of such a corpus is the parsed version of the ICE-GB corpus, and the specialised ICECUP software used to search its syntactic annotations.

A5-3) Choose a feature often said to differentiate two dialects, and see if the claimed distinction is actually in evidence in the frequency data from a spoken corpus.

There are many, many features that you could look at here, and we cannot go through them all here. Let's consider a single example, the could(n't) care less idiom.

A simple search for could care less in the spoken section of the BNC produces zero results, whereas could (n't|not) care less has 17 hits. (Note that this is BNCweb simple query syntax, as above.) This is equal to 1.63 per million words. There is also one case of don't care less, but it is surrounded in the transcription by stretches of unclear speech, so we should perhaps not read very much into this one instance this.

Using the Longman Spoken American Corpus as a point of comparison, we find one instance of couldn't care less, one of could not possibly care less, and 5 of could care less with no negator. The overall frequency of the idiom is 1.1 per million words.

So, we see no evidence for the not-less form in UK English, as expected; but on the US English side, we find a relatively-sizable minority of cases using the version with not.

This instantiates the general pattern that we would expect you to find in your results for this exercise, for nearly any feature you pick – namely, these differences are ones of degree rather than absolutes. Dialects (especially the different national standard dialects of English) differ in terms of the frequency of various constructions more often than they differ in terms of the absolute presence or absence of some construction or feature.

Website contents

Answers to exercises: Chapter Five practical activities

A5-1) Analysing texts relative to Biber's dimensions

A5-2) How confident are you that these searches return all and only the examples of the structure you wanted?

A5-3) Choose a feature often said to differentiate two dialects, and see if the claimed distinction is actually in evidence in the frequency data from a spoken corpus.