BNC2014 logo


British National Corpus 2014

A new resource for research and teaching on the contemporary English language


Frequently Asked Questions

General BNC2014 FAQs

Who built the BNC2014? Who funded the project?

The BNC2014 is being compiled by a partnership of linguists at the ESRC Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and ELT experts at Cambridge University Press (CUP). Robbie Love is lead researcher for the Spoken BNC2014 and Abi Hawtin is lead researcher for the Written BNC2014. The team also includes Tony McEnery, Vaclav Brezina, Andrew Hardie and Claire Dembry (CUP).

The construction of the Spoken BNC2014 was jointly funded by CASS and CUP. The construction of the Written BNC2014 is being funded by CASS.

Why is it called the BNC2014?

We used the year 2014 in the name of the corpus for three reasons:

  1. It's 20 years on from the release of the original British National Corpus (1994)
  2. 2014 is the year in which CASS and CUP launched the project
  3. For the spoken corpus, 2014 is the median year of the data, which was collected from the years 2012 to 2016.

How should I distinguish the BNC corpora when writing about them?

We recommend the following conventions for writing about the BNC corpora:

  • Original BNC = ‘the BNC1994’
  • New BNC = ‘the BNC2014’
  • Spoken components = ‘the Spoken BNC1994’ and ‘the Spoken BNC2014’
  • Written components = ‘the Written BNC1994’ and ‘the Written BNC2014’

Spoken BNC2014 FAQs

How do I cite the Spoken BNC2014 in my work?

The primary publication for the Spoken BNC2014, which all research using the corpus should cite, is:

  • Love, R., Dembry, C., Hardie, A., Brezina, V. and McEnery, T. (2017). The Spoken BNC2014: designing and building a spoken corpus of everyday conversations. In International Journal of Corpus Linguistics, 22(3), pp. 319-344.

Why was the only way to access the corpus through CQPweb?

For the first 12 months of its release, the Spoken BNC2014 was available exclusively through Lancaster University’s CQPweb server. This allowed us to monitor uptake of the resource.

Have the full text files of the corpus been released?

Yes. The full corpus has been made available for publicly-accessible download as XML files, along with the associated metadata, as of Autumn 2018. The release includes tagged (POS, lemma, semantic tag) and untagged versions of the XML files. They are available for download via the same licence-signup interface as CQPweb access.

What about ‘context-governed’ data?

A key decision we made early in the creation of the Spoken BNC2014 was to collect data which occurred only in informal contexts – i.e. data which would be broadly comparable to the ‘demographically-sampled’ component of the Spoken BNC1994. The rationale for gathering recordings from this single type of situational context is simply that there is greater use of, and demand for, conversational data. Researchers who want to look at British English in specific contexts, especially relatively public contexts, tend to collect their own, specialized corpora. Moreover, some such specialized corpora have been released publicly by their creators and are available to researchers with an interest in the defined context in question. These include:

  • the British Academic Spoken English Corpus (BASE), which contains university lectures and seminars (Thompson and Nesi 2001);
  • the Cambridge and Nottingham Business English Corpus (CANBEC; Handford 2007);
  • the Characterizing Individual Speakers (CHAINS) corpus, which represents a variety of speech styles (Cummins et al. 2006);
  • the Nottingham Health Communication Corpus (Adolphs et al. 2004); and,
  • the Vienna-Oxford International Corpus of English (VOICE), which comprises face to face interactions between speakers of English as a lingua franca (Seidlhofer et al. 2013).

So, researchers with an interest in context-governed English speech already have options open to them. However, a general corpus of informal speech, in private contexts, is harder to collect due to the requirements of size and demographic spread, and the difficulty of the context to access. Therefore, it is much more in demand in the research community.

Why aren’t you making the audio recordings available too?

We understand that there is great research potential associated with the audio files from which the Spoken BNC2014 transcripts were derived. However, the goal of the first phase of the Spoken BNC2014 project was to produce and make available the transcripts, as a corpus, as quickly as possible. The preparation of the audio files for release will require lots of additional work – the main challenge being to de-identify the 1,000 hours’ worth of audio files, i.e. to ensure that things such as names and addresses are ‘bleeped out’. We do plan to do this in the future, but it was not possible to include this task on top of the work required to prepare the corpus itself.

Written BNC2014 FAQs

When will the Written BNC2014 be made available to the public?

Updates on our progress compiling this component of the BNC2014, and ultimately an announcement regarding its release date, will be published on the Written BNC2014 page on the CASS website.


This page was last modified on Monday 12 November 2018 at 9:22 am.