by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012

The web and legal issues

The most fundamental issue in corpus construction is whether or not you have the legal right to gather and distribute the data you intend to include in your corpus. This issue has become especially pressing due to the increasing important of the web in corpus building.

Before the age of the web, to collect a text in electronic form it was necessary either to get the original file from the publisher, or to rely on re-typing (time consuming and expensive) or optical character recognition software (error prone). But with the web it has become extremely straightforward simply to download and save large quantities of text from the web to create a corpus. Being able to collect the text, however, does not mean you necessarily have the legal right to reproduce and distribute it. Copyright applies to the web, just as to print.

There are several ways to address this issue.

  1. treat text from the web the same as any other text. That is, the corpus builder contacts the copyright holder and requests permission to redistribute the text within a corpus under the terms of some specified licence.
  2. collect data only from sites which explicitly allow the re-use and redistribution of text. For example, a website that declares that its content is public domain, or that puts its content under a licence which permits copying and redistributing.
  3. collect data without any regard to seeking permission. Then, instead of distributing it, make it available through a tool that does not allow copyright to be breached. Many web corpora are made available through fourth-generation, web-based concordancers (see part two) where only a few words of context around the node word are visible. Since it is impossible to reconstruct the original texts from the snippets in the concordance, this ‘redistribution’ is not a dangerous copyright violation.
  4. redistribute not the downloaded data files, but rather a list of the web addresses from which the corpus has been collected. This does not breach copyright at all – but any researcher can reconstruct a copy of the corpus from the address-list quickly and easily. However, web links are broken over time and pages do not always remain available forever; so while this approach has some advantages it is not a complete solution.


This page was last modified on Monday 16 April 2012 at 10:12 am.

Tony McEnery Andrew Hardie

Department of Linguistics and English Language, Lancaster University, United Kingdom