BNC2014 logo

British National Corpus 2014

A new resource for research and teaching on the contemporary English language

Frequently Asked Questions

General BNC2014 FAQs

Who built the BNC2014? Who funded the project?

The BNC2014 is being compiled by a partnership of linguists at the ESRC Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and ELT experts at Cambridge University Press (CUP). Robbie Love is lead researcher for the Spoken BNC2014 and Abi Hawtin is lead researcher for the Written BNC2014. The team also includes Tony McEnery, Andrew Hardie and Vaclav Brezina (Lancaster) and Claire Dembry (CUP).

The construction of the Spoken BNC2014 was jointly funded by CASS and CUP. The construction of the Written BNC2014 is being funded by CASS.

Why is it called the BNC2014?

We used the year 2014 in the name of the corpus for three reasons:

  1. It's 20 years on from the release of the original British National Corpus (1994)
  2. 2014 is the year in which CASS and CUP launched the project
  3. For the spoken corpus, 2014 is the median year of the data, which was collected from the years 2012 to 2016.

How should I distinguish the BNC corpora when writing about them?

We recommend the following conventions for writing about the BNC corpora:

  • Original BNC = ‘the BNC1994’
  • New BNC = ‘the BNC2014’
  • Spoken components = ‘the Spoken BNC1994’ and ‘the Spoken BNC2014’
  • Written components = ‘the Written BNC1994’ and ‘the Written BNC2014’

Spoken BNC2014 FAQs

How do I cite the Spoken BNC2014 in my work?

The primary publication for the Spoken BNC2014, which all research using the corpus should cite, is:

  • Love, R., Dembry, C., Hardie, A., Brezina, V. and McEnery, T. (2017). The Spoken BNC2014: designing and building a spoken corpus of everyday conversations. In International Journal of Corpus Linguistics, 22(3), pp. 319-344.

Why do I need to access the corpus through CQPweb?

For the first 12 months of its release, the Spoken BNC2014 is available exclusively through Lancaster University’s CQPweb server. This allows us to monitor uptake of the resource.

Will the full text files of the corpus be released? When?

Yes. The full corpus will be made available for publicly-accessible download as XML files, along with the associated metadata, in September 2018. We will release tagged (POS, lemma, semantic tag) and untagged versions of the XML files.

What about ‘context-governed’ data?

A key decision we made early in the creation of the Spoken BNC2014 was to collect data which occurred only in informal contexts – i.e. data which would be broadly comparable to the ‘demographically-sampled’ component of the Spoken BNC1994. The rationale for gathering recordings from this single type of situational context is simply that there is greater use of, and demand for, conversational data. Researchers who want to look at British English in specific contexts, especially relatively public contexts, tend to collect their own, specialized corpora. Moreover, some such specialized corpora have been released publicly by their creators and are available to researchers with an interest in the defined context in question. These include:

  • the British Academic Spoken English Corpus (BASE), which contains university lectures and seminars (Thompson and Nesi 2001);
  • the Cambridge and Nottingham Business English Corpus (CANBEC; Handford 2007);
  • the Characterizing Individual Speakers (CHAINS) corpus, which represents a variety of speech styles (Cummins et al. 2006);
  • the Nottingham Health Communication Corpus (Adolphs et al. 2004); and,
  • the Vienna-Oxford International Corpus of English (VOICE), which comprises face to face interactions between speakers of English as a lingua franca (Seidlhofer et al. 2013).

So, researchers with an interest in context-governed English speech already have options open to them. However, a general corpus of informal speech, in private contexts, is harder to collect due to the requirements of size and demographic spread, and the difficulty of the context to access. Therefore, it is much more in demand in the research community.

Why aren’t you making the audio recordings available too?

We understand that there is great research potential associated with the audio files from which the Spoken BNC2014 transcripts were derived. However, the goal of the first phase of the Spoken BNC2014 project was to produce and make available the transcripts, as a corpus, as quickly as possible. The preparation of the audio files for release will require lots of work – the main challenge being to de-identify the 1,000 hours’ worth of audio files so that things such as names and addresses are ‘bleeped out’. We do plan to do this in the future, but it was not possible to include this task on top of the work required to prepare the corpus itself.

Written BNC2014 FAQs

When will the Written BNC2014 be made available to the public?

The Written BNC2014 will be made available as of Autumn 2018. Updates on this component of the BNC2014 project will be published as this date approaches.

This page was last modified on Monday 25 September 2017 at 2:53 am.