Website contents

Answers to exercises: Chapter Seven practical activities

A7-1) How can we search a corpus for clauses appropriate for determining the Basic Word Order (BWO) of a language?

In typology the Basic Word Order (BWO) of a language is a key feature for classifying that language’s grammar. BWO is determined on the basis of main, positive, declarative, prototypical transitive clauses, where the subject and object are both full noun phrases (i.e. not pronouns or clauses) and the object is definite – clauses like The architect built the house, to give a fabricated example. The most frequent order in clauses of this specific type is deemed the BWO. Ideally this should be established on the basis of corpus data.

However, it is far from straightforward to actually search for clauses of this type. This is true even in English, where we can use POS tagging to help us. But it’s worth noting that a field linguist or typologist actually trying to work out the BWO of an unfamiliar language would not be able to rely on POS tagging, as a POS tagging scheme presumes we already have a model of the grammar! However, let’s assume for the moment we are working on English, and we can use POS tags.

There are numerous ways to approach this task. Creating a search which narrows things down straight away is very difficult, so it is perhaps better to start with as general a search as possible, and to narrow it from there using a series of heuristics. Here is one way to approach the analysis.

Initial query. Use a POS tag to search for “any verb” (since all clauses must contain a verb). We do not need to worry about specifying a finite form, in English anyway, since non-finite forms are often used as part of a complex verb construction that is finite overall.
Filter for negatives. Use collocation or similar function to filter out any examples where not or n’t occurs close to the node verb (say, within two tokens).
Filter for pronominal subjects and objects. If the verb is immediately preceded or followed by a personal pronoun, or preceded by a determiner, discard it.

The preceding steps can be done automatically, assuming a fairly sophisticated concordancer. The resulting set of clauses would have a lower percentage of unwanted examples than the original set of results for the query. At this point, however, it would be necessary to resort to manual filtering – for instance, by sorting the concordance randomly, and working through the sorted concordance, discarding examples that do not meet the criteria, until enough eligible examples have been collected to make a reasonable estimate of word order frequencies. The resulting sample would not be entirely random, but it would be defensibly close enough to a random sample to pass methodological scrutiny.

Of course, it would be equally possible to skip any automatic filtering, and to go straight to manual analysis to collect a random sample. But this increases the amount of manual decisions that could just as easily have been taken by a machine. (“This one’s followed by not – discard. This one’s preceded by a pronoun – discard...”)

In a parsed corpus, many aspects of the filtering become easier as you no longer have to infer, for instance, that a verb is negative from the proximity of not – rather, the syntactic relations will be present explicitly in the annotation. Precisely how easy the filtering becomes depends on the type of parsing. Dependency parsing should make it very easy, assuming mastery of the parsing schema and an appropriate concordancer – you would need to specify a search for a verb with both a subject and an object relation, with a noun at the end of each of those dependencies, where the verb itself is not dependent on a main clause. Constituency parsing would help, but would also present challenges. You could not just search for [NP] [VP [NP] VP], as such a search presupposes SVO order! And for either form of parsing, the accuracy of the analysis would have to be considered. Is it solely the output of an automated tool? If so, is the error rate acceptable? Or has it been manually corrected? If so to what extent?

To assess the individual-text approach, we took a single short story (A Colder War by Charles Stross) and extracted all the clauses meeting the syntactic criteria. In a 12,400 word story, we found 31 clauses which are main, positive, declarative clauses with noun phrase subjects and objects where the object is definite. Our general impression was that most clauses are not main, declarative clauses; most main, declarative clauses are not transitive; most transitive clauses have a pronoun either as subject or object; and of those clauses with full noun phrases for subject and object, most have an indefinite object. So the 31 clauses represent a tiny minority. The next question is how many of these 31 are prototypical, that is, have an agent as the subject and a patient as the object. Obviously, this depends to some extent on what one “counts” as an agent or patient. Taking a fairly broad definition, we consider 19 out of 31 to be prototypical. 19 clauses is approximately 1 BWO-relevant clause found every 650 words. This may not sound terribly inefficient, but remember that we had to do a grammatical analysis of all 650 words of text in order to pick out that single relevant clause. To us, this does not seem a terribly good means of finding BWO-relevant clauses – especially since many of the clauses we found were clearly of a type that would be especially common in this specific genre, namely fiction, for instance clauses of the form (character) (locates) (character’s body part) such as “Roger shakes his head”, or “The colonel ... stretches his feet out”. Such slanted data, located with such a large amount of effort, does not seem a very good alternative to an (admittedly very tricky) corpus analysis.

For what it is worth, all 19 of the prototypical clauses we found had subject – verb – object word order!

A7-2) Undertake a a basic collostruction-style analysis of the (verb) (someone) in the (body-part) construction.

Here is a suggested search term for this exercise – using BNCweb/CQPweb simple syntax, as per the answers to the exercises for chapters one and two.

_VV* ( (_{ART})?(_{ADJ})* (_{N})+ | _{PRON}) in the

That is, a verb followed by a noun phrase followed by in the, where a simplified set of noun phrase patterns are allowed (either a pronoun, or a noun preceded by optional article, adjectives, and other nouns; this does not capture every noun phrase, but as we’ve explained, capturing every noun phrase is in principle not possible). The search pattern does not capture the second noun slot, but this isn’t necessary – we can look at R1 collocates.

This search pattern is not at all precise, however, as there are very many clauses where the object or subject complement of a verb can be followed by a preposition phrase with in without it being an example of this particular construction. In fact, the overwhelming majority of hits for that search term will not be of the construction. You would have to do a very large amount of manual filtering if you worked straight from the concordance. Fortunately, the R1 collocates come to the rescue; if you look through these for body part nouns, and use those nouns to filter the results, then you are left with a set of concordance lines with a much higher proportion of “hits” to “misses”.

Here are some body-part noun frequencies for the R1 collocates of the suggested search term in the BNC:

face – 378
eye – 152
hands – 116
chest – 57
stomach – 53
mouth – 71
ribs – 34
groin – 19

Looking at the concordances, we observe the following. A lot of the instances with face are actually an adverbial in the face of (something). But about half the examples seem to be the construction we were looking for. Verbs used include: bite, elbow, hit, kick, look, punch, stare, shoot, slap, smack, strike, thump, touch. Look and stare, near-synonyms, are especially common, as are punch and strike.

For eye, many of the same verbs co-occur including punch and stare. But overwhelmingly common is look. One additional verb seen with eye is poke. The concordance of hands contains no instances of the construction of interest: all examples are of the form (put) (something) in the hands of (someone) – where verbs of “putting” other than put are often used. Verbs with chest include butt, hit, jab, kick, poke, punch, prod, shoot, stab, strike – i.e. similar to face but without look/stare. Stomach is very similar (though we add thump). Mouth co-occurs with slap as well as a similar range of hitting-verbs to chest and stomach. Ribs also has many of the general hitting-verbs, but specific to it are dig and nudge. Finally, two characteristic verbs with groin are kick and knee.

So, a lot of the verbs involved – especially punch, strike, hit, kick but also including poke – can be used with many body parts, but some verbs are specific to nouns: dig (someone) in the ribs, nudge (someone) in the ribs, knee (someone) in the groin. Which of these you guessed in advance will vary, but we would be very surprised if any reader would have come up with all of them!

To what extent are the verb-noun links explicable, as opposed to arbitrary facts of phraseology? Some are clearly explicable. There are no instances of getting kneed in anything other than the groin. But knee (someone) in the groin is not an arbitrary combination – the groin is low enough in the body to be a much more likely target of kneeing than is, say, the chest. Likewise, it is not arbitrary that the use of this construction with verbs of looking (i.e. look/stare) is linked to face and eye: the face, and specifically the eyes, are where you establish eye contact with someone – though the more specific meaning of look (someone) in the eye meaning, roughly “approach (someone) with courage” is more clearly a phraseological fact, as this semantic extension is not wholly predictable.

A7-3) How often is literally used literally as opposed to metaphorically? Could searches for the word literally in a corpus be used to locate examples of metaphorical usage?

When we undertook this exercise, we found just over half of the examples of literally to be used to mark something which is not in fact literal but is actually metaphorical. This will, however, vary somewhat depending on your corpus. In our data (from the BNC), we saw examples of both highly-conventionalised and fairly novel metaphors. For instance, in the blood was literally running down my arms, the use of run to describe the motion of liquid across a surface is conventional to the extent that some people might not even classify it as a metaphor in the contemporary language. On the other hand, in the sentence Jack could feel the waves of hatred washing over him – a real force which literally buffeted him, we are dealing with something rather more like a creative, literary type of metaphor.

Given the approximately half-and-half rate of metaphorical and non-metaphorical meanings expressed, we could only make use of literally as a search term for metaphoricity if we were prepared to spend the necessary time to filter the results. We would also be left with a set of examples that would not represent metaphorical language in general – we have no way of knowing whether or not metaphors used alongside literally are qualitatively different from metaphors used elsewhere, but it doesn’t seem unlikely. Note, for instance, that since literally is an adverb, it is often (if not typically) used in contexts where the metaphorical expression is a verb.