by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012
 

Answers to exercises: Chapter Two practical activities

A2-1) How do you search for words ending in -ness?

As we noted, in BNCweb/CQPweb, the command for this search is:

The * character here is what is called a “wildcard“. In the simple query language used by BNCweb/CQPweb, as well as * for “any sequence of zero or more characters”, you can also use ? for “any single character”.` Most concordancers allow searching with wildcards; however, precisely what the wildcard characters mean can vary. AntConc allows * and ? as well, as does WordSmith; but in WordSmith * means specifically “disregard the end of the word” (or the beginning).

Wildcard searches are quite limited, so many concordancers, including AntConc and CQP, alsogive you access to a more sophisticated system called regular expressions. A regular expression or regex is a search-string expressed in terms of a special, fairly standardised language that has many uses in computer programming as well as in text searching. Using regex we can specify much more precise query terms. For instance, a regex can specify a range of characters that must be matched, by putting the letters into square brackets. This is more efficinet than using a single wildcard. For example, to search for frown or brown using wildcards we would have to specify ?rown, which would also catch crown and drown. With regex, we would be able to search for the following:

Note that we are assuming here (a) that the concordancer already knows about word boundaries and (b) that the concordancer will always search for a whole word. These assumptions are often true, for example, in CQP. But they do not always apply in concordancers which search the underlying text directly. In the latter case, we would have to specify the location of the word boundaries that we want in the regular expression, using the \b code:

That means that using regex, the search term for words ending in -ness will be one of the following:

... where \w means “any letter or number” and + means “one or more of the preceding thing”.

We don't have space here to provide a full account of regular expressions, but the site we linked above, regular-expressions.info, while not written for linguists, does provide a good overview of the syntax. One important thing to be aware of is that, while basic search syntax usually assumes that searches should be case-insensitive, in regex this is not assumed, and you have to explicitly specify case-insensitivity if you want it.

A2-2)How would you create a search to retrieve all examples of both colour and color?

The answer is similar to that in A2-1. If you have a concordance with a simple search system, you will need to manipulate a very limited number of search functions to achieve this. A system using regular expressions, while more complex to master, will allow you to craft a much more precise search term to explore this spelling variation.

Using the BNCweb/CQPweb simple query wildcards, either of these will work:

Using regular expressions, there are various options:

... to which you might need to add \b to enforce word boundaries, as mentioned above.

A2-3) How can you control whether you find nouns, verbs, or both?

There is no single answer here, as it depends a lot on (a) the format of your tagged data and (b) what kind of concordancer you have – one that searches the underlying files directly, such as AntConc, or one that is aware of annotation and indexes the files up according to the structure of the annotation, such as Xaira or CQP.

Let's assume we are working with a concordancer that searches through the underlying files. Let's further assume that we have data in the following, SGML-based format (which is equivalent to the format used in the original edition of the BNC):

<w AV0>Today <w PNP>I <w VM0>will <w VVI>record <w AT0>a <w NN1>song <w PRP>onto <w AT0>a <w NN1>record

(The tags are BNC-style C5 tags.) A search for “record as (any sort of) noun” would then look something like the following:

An alternative, non-standard but widely used, format for tags is to use an underscore to join words to tags, thus:

Today_AV0 I_PNP will_VM0 record_VVI a_AT0 song_NN1 onto_PRP a_AT0 record_NN1

In that case the following will do the trick:

In software that is aware of the markup, the underlying format of the data is not important. Rather, what you have to do is separately work out what the search strings are for the word and for the tag, and then work out how to tell the concordancer which search string to apply to the tags. This might be via a graphical interface (using menus or other means to link words to tags) or via a textual query syntax. For example, in CQP, the search term would be this:

BNCweb and CQPweb actually hide away this complexity by allowing tag searches to be specified with the old-fashioned underscore notation, even though the underlying corpus text does not really have any underscores in it. So record_N* will work in those tools too.

 
Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom