BNCweb (CQP-Edition)

A web-based interface to the
British National Corpus

 

Table of Contents:

What is BNCweb?

BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version.

Main advantages of BNCweb:

  • It is very user-friendly.
  • It is a web-based application - end users do not need to install any extra software on their computers. Any web-browser (on any platform) will do.
  • It is fast! Have you ever wanted to calculate collocations for the noun lemma TIME? A collocation analysis of its 180,243 instances in the BNC takes just under 30 seconds on a version of BNCweb which is installed on an entry-level Apple MacBook.
  • It is powerful and flexible: In addition to basic queries (available via an intuitive query syntax), the interface allows more complex searches using full-fledged CQP-syntax.
  • It is optimized for use by larger groups of users: a cache system minimizes CPU-load and disk space usage when different users perform the same queries.
  • It is absolutely free (its components are released under the GNU Public License).

Features of BNCweb

BNCweb offers a whole range of features for corpus analysis (concordance display, sort, collocations, distribution analysis, etc.). The following list is a basic overview of the different functions.

Name Description
Query result
  • Standard concordance display (KWIC-view and sentence view) with one-click access to the larger context of query results and the relevant speaker and file information (such as 'Age of speaker', 'Domicile of author', 'Date of creation' etc.). Restrictions for written and spoken texts can be applied to your lexical search.
  • Concordance lines can be displayed in random order.
  • Immediate access to the n-th instance of your query result. No upper restrictions apply - you can for example look at the 5,456,345th instance of the.
  • Query results always include normalised frequency information (instances per million words). This also applies to queries which are based on metatextual restrictions or subcorpus searches.
Thin Reduces the number of lines in the concordance window, e.g. by creating a random subset of 1000 solutions.
Distribution Creates descriptive statistics for your query result, e.g. its distribution over the 9 text domains in the written component of the BNC. Includes normalised frequency counts and information about the total number of words in each category. Crosstabulation over two categories is also possible. BNCweb also lists the files in which the query result occurs most/least frequently.
Sort Sorts the query result alphabetically on any of five positions to the left or right of your node. POS-tag restrictions can be applied to a sorted result and a frequency list of the chosen position can be compiled (with absolute frequencies and percentage information.)
Collocations Compiles a ranked list of collocates (both word forms and lemma forms) for your query result according to several statistical methods: Mutual information, MI3, Z-score, T-score, Chi-squared with Yates' correction, Log-likelihood, and Modified Dice coefficient. Various parameters can be set to allow users optimal flexibility in their analysis (e.g. flexible window-size, including asymmetrical windows, etc.).
Save queries Query results can be saved for future access. This is useful for queries which were manually post-processed by the user (e.g. using the 'delete hits' function).
Categorize query result Individual hits of a query result can be manually categorized according to a user-defined set of values. Once categorization is complete, all post-processing features of BNCweb (such as distribution analysis, collocations, etc.) can be applied to categorized data separately.
Download Query results can be downloaded to your hard-disk in a tab-delimited format (including metatextual information, if required). This data can be imported into the spreadsheet program of your choice for manual annotation. By default, the corpus positions (i.e. internal references to corpus tokens) of all matches will also be downloaded. This information is required to re-import a query result (or parts of a query result) back to BNCweb (see below).
Upload external data file This feature can be used to upload a file with corpus positions to the BNCweb server to create a new saved query. This function can be used to re-import a manually postprocessed (e.g. cleaned) query result to make it available for use with any of the automated post-processing features offered by BNCweb (collocations, distribution, etc.).
Scan keywords/titles Retrieves a list of BNC text files on the basis of the classification contained in the <title> and <keyword> elements of the file headers. This list can be used to define a subcorpus.
Explore genre labels Retrieves a list of BNC text files on the basis of David Lee's genre classification scheme encoded in the file headers. This list can be used to define a subcorpus.
Create/edit subcorpora Offers the user several options to create (and edit) user-definable subcorpora: by manually entering a list of filenames or via the keyword/title/genre search features.
Browse a file Provides access to the larger context of any <s>-unit in the BNC via its file ID and <s>-unit number. The user has a choice between POS-tagged and untagged output. In addition, word-class colouring is available.
Word lookup Produces alphabetically ordered lists of lexical items (<w>-units) and lemma forms - e.g. all words (or lemmata) starting with help (e.g. help, help-desk, helplessness, etc.).
Frequency list Displays a frequency list of items (<w>-units or lemmata) in the whole BNC or any user-defined subcorpus. Users can define restrictions for frequency lists in various ways (POS-tags, regular expression patterns, frequency range, etc.).
Keywords Compares frequency lists for two (user-definable) subcorpora in the BNC and identifies items that are particularly frequent or infrequent in one of the two lists.
Query history Lists all queries performed by the user - including the date and time of the query and the number of retrieved hits. Queries can be re-executed by clicking on a link.
User settings Influences standard output formats as desired by the user (e.g. size of context display in number of <s>-units, KWIC or standard sentence display, etc.).
Recursive use of features All features of BNCweb are available recursively, i.e. you can for example first get a distribution of your query result over 'Age of author', choose only those hits written by authors aged 25-34, sort them, get a frequency list on the second word to the right of the node, click on the most frequent item in the list and calculate collocations of your query result on the basis of the remaining sentences. There is no upper limit for the number of steps that can be combined.
Cached queries All queries are cached (up to a sysadmin-definable maximum of disk space) and can be re-executed instantaneously. Cached queries (and MySQL tables created for the post-processing features of BNCweb) are available to all users on the same server - this radically reduces disk space usage when whole groups of users perform the same query.

Corpus Linguistics with BNCweb

Cover Hoffmann, Sebastian, Evert, Stefan, Smith, Nicholas, Lee, David and Ylva Berglund Prytz. 2008. Corpus Linguistics with BNCweb - a Practical Guide. Frankfurt am Main: Peter Lang.

Abstract:
This book presents a richly illustrated, hands-on discussion of one of the fastest growing fields in linguistics today. The authors address key methodological issues in corpus linguistics, such as collocations, keywords and the categorization of concordance lines. They show how these topics can be explored step-by-step with BNCweb, a user-friendly web-based tool that supports sophisticated analyses of the 100-million-word British National Corpus. Indeed, the BNC and BNCweb have been described by Geoffrey Leech as "an unparalleled combination of facilities for finding out about the English language of the present day" (Foreword). The book contains tasks and exercises, and is suitable for undergraduates, postgraduates and experienced corpus users alike.

This book was short-listed for the BAAL Book Prize 2009

Sample Chapter: Table of Contents and Chapter 1

Buying the book: The book is now available through amazon.co.uk, amazon.de and amazon.com. It can also be ordered directly online through Peter Lang. You can also fill in this order form and send it to Peter Lang. Availability on amazon.co.jp is currently restricted to resellers who charge a ridiculously high price. We are in contact with Peter Lang about this and hope that this situation will improve soon.

Access to the BNC via BNCweb at Lancaster University

The BNC can be accessed via a service hosted at Lancaster University. The service is free of charge and available to anybody who registers with a valid e-mail address.

Since the BNC is a licensed product, certain access restrictions are implemented:

  • Standard users do not have access to the larger context of individual concordance lines.
  • Query results with more than 5000 matches are automatically downsampled to 5000 concordance lines. The selection is random but reproducible.
  • The "Browse a Text" feature is disabled.

Please note that these restrictions may change at any time without prior notice.

Current license holders of the BNC (XML-version) may in the near future be able to get full access to the corpus via the Lancaster server. Registered users of BNCweb will be notified of this option by e-mail once details are available.

URL for signing up: http://bncweb.lancs.ac.uk/bncwebSignup/

What's the story behind BNCweb? Who created it?

The original BNCweb was created by Hans Martin Lehmann, Sebastian Hoffmann and Peter Schneider at the University of Zurich. It was mainly written because the English Department in Zurich had no Windows computers to run the SARA-client provided as part of the BNC distribution. What started as a quick hack soon became a fully-fledged corpus tool. The functionality of BNCweb initially relied on the SARA server software to access the information contained in the BNC, but the integration of the powerful relational database system MySQL meant that the range of available features could be greatly extended. BNCweb was first publicly released in May 2002. An evaluation of the features offered by this earlier version of BNCweb can be found in a review that was published on the Linguist List.
The CQP-version of BNCweb no longer relies on the SARA server but instead uses the powerful Corpus Query Processor (CQP) of the Corpus Workbench. The current version of the interface was created by Stefan Evert (University of Osnabrück) and Sebastian Hoffmann (Lancaster University). Further information on the CQP-version of BNCweb can also be found in Hoffmann & Evert (2006).

Installation requirements

Users may wish to install BNCwen on their own server. It has a client-server architecture: it is designed to give a (potentially large) number of concurrent users access to a server-side BNCweb installation via their standard web browsers.

As a result, BNCweb requires no special installation procedure or client program on the part of the end user - any web-browser (under any kind of operating system) can be used to access the BNC via the Internet or a local area network.
Please note that it is not possible to use BNCweb in conjunction with the BNC Online service hosted by the British Library.

On the server side, BNCweb consists of a set of Perl scripts that require installation on a UNIX system such as Linux, Mac OS X, Sun Solaris, etc. While the installation procedure has been much improved in comparison with earlier versions of BNCweb, some basic knowledge of UNIX system administration is still required. An installation manual (in text-only format) is available here.

The following tools and libraries need to be installed on your server - many of which may already be pre-installed on your system:
  • xsltproc (from the Gnome LibXSLT package)
  • MySQL 4.1 or higher - 5.0 recommended (http://www.mysql.com)
  • Perl 5.8 (version 5.6 may work, too, but might require installation of additional modules)
  • Perl modules:
    • DBI
    • DBD::mysql
    • HTML::Entities
    • Parse::RecDescent
A full installation of BNCweb requires about 3GB of disk space. However, during the indexing process, quite a bit more space is needed (approx. 8-10 GB free disk space should be more than sufficient.)

Download

BNCweb is now freely available under the terms of the GNU General Public License, version 3.

Download: BNCweb-distribution.zip (version 4.0, 23/11/2007)
An updated version of the scripts will be available for download soon.
Update (20.11.2008): Feedback from beta-testers has meant that some changes had to be made. A new round of testing is currently being carried out. Sorry for the delay!

Please note: BNCweb requires a version of the Corpus Workbench that is not yet available via sourcefourge. Please use the source code provided as part of the BNCweb distribution. Also, please check this page from time to time for updates to BNCweb.

Errors and inconsistencies in BNC-XML

No corpus of the size of the BNC can be without errors. Many of the problems of the first release of the corpus have in the meantime been corrected. However, a number of issues (e.g. duplicate stretches of text) remain even in the XML version of the corpus. More importantly, a range of new errors were introduced as a result of the automated conversion routines from SGML to XML. This is particularly the case for the spoken component of the corpus, where thousands of tags indicating speaker overlap, unclear passages etc. have been lost. Since this may have an impact on at least some of the linguistic findings that can be made with the corpus, users of BNCweb may wish to look at a detailed list of these errors.

For questions about BNCweb, please write to bncweb@mac.com.


Last updated: 15.10.2008