by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

Monolingual versus multilingual corpora

Many corpora are monolingual – they contain data in only one language. But there are two types of multilingual corpora.

Comparable corpora

A comparable corpus contains components in two or more languages that have been collected using the same sampling method, e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period. The subcorpora of a comparable corpus are not translations of each other. Rather, their comparability lies in the similarity of their sampling frames. An example is the use of the LOB corpus sampling frame for the Lancaster Corpus of Mandarin Chinese (McEnery et al. 2003), making these corpora comparable.

Parallel corpora

By contrast, a parallel corpus contains native language (L1) source texts and their (L2) translations. In this case, the sampling frame is automatically the same for all the languages in the corpus. Examples include the the Canadian Hansard corpus (Brown et al. 1991) and the CRATER corpus (McEnery and Oakes 1995).

For a parallel corpus to be useful, an essential step is to align the source texts and their translations, annotating the correspondences between the two at the sentence or word level (see Oakes and McEnery 2000 for an overview). Automatic alignment of parallel corpora is possible for some language pairs, but for others, it can be a very great challenge.

Trying it out!

The best way to see the benefits of an aligned corpus is to try some searches in such a corpus! The EUROPARL corpus, which contains European Union documents in English, German, French, Italian, Spanish and Dutch, is aligned throughout at the sentence level. An online interface to this corpus is available. Before running a search, make sure to click on the Simple Query option – otherwise the website will expect you to use a very complex query language called CQP.

When you run a search on EUROPARL, you will see that every concordance line is followed by a table of the equivalent sentences in all the other languages. This is very useful, for example, if you are interested in a particular word or concept and want to find out whether it is always translated in the same way or not.


