
    CWB encoder for the British National Corpus (BNC), XML edition
    (C) 2007-2008 by Stefan Evert


This package contains a set of Perl scripts for indexing the British National
Corpus (BNC), XML edition, with the IMS Open Corpus Workbench
[http://cwb.sourceforge.net/]. The encoding procedure preserves as much of the
original annotation as possible. Different versions of the encoding script are
available for interactive use of the corpus, and for use with the BNCweb
interface [http://www.bncweb.info/].


PREREQUISITES

 - a licensed copy of the British National Corpus (BNC), XML edition
   [http://www.natcorp.ox.ac.uk/getting/] (the encoding scripts require access
   to the original XML files, but the resulting CWB-indexed version is fully
   autonomous)

 - the IMS Open Corpus Workbench (CWB), version 3.0 [http://cwb.sf.net/]

 - the CWB/Perl interface, version 3.0 [http://cwb.sf.net/]

 - the XSLT processor "xsltproc" [http://xmlsoft.org/XSLT/] (xsltproc is
   shipped with Mac OS X and most Linux distributions, or can easily be
   installed as an add-on package)

 - approximately 5 GB of disk space during installation (the resulting
   CWB-indexed corpus requires less than 2.5 GB)

Make sure that an appropriate version of the CWB has been installed, and that
the CWB/Perl modules are available in the standard Perl search path (you may
need to set the environment variable PERL5LIB if they have been installed in a
non-standard location). The "xsltproc" program must be available in the
standard search path (type "xsltproc -V" to check); if it is installed in a
non-standard location, you may need to adjust the environment variable PATH.

NB: The encoding script creates a large number of data files at the same time.
On some operating systems, you will need to increase the allowed number of
files to be accessed simultaneously by issuing the command "ulimit -n 512"
before starting the encoding process.


INDEXING FOR INTERACTIVE USE

Use the script "EncodeBNC.perl" to encode the BNC for interactive use with the
CQP query processor. The basic usage is

    perl EncodeBNC.perl /cwb/data/directory /path/to/bnc/files/

where "/cwb/data/directory" refers to the name of a new directory for the CWB
index files (which will automatically be created by the script), and
"/path/to/bnc/files/" is the full path to the directory tree containing the
XML source files of the BNC (with extension ".xml", optionally compressed as
".xml.gz"). Type

    perldoc EncodeBNC.perl

for information about program options. You should at least specify --encoding
(-e) and --memory (-M) flags when you run the encoder; many users find the
progress messages printed by --verbose (-v) reassuring.

The encoding procedure will take between 4 and 10 hours, depending on CPU
speed and available memory.


INDEXING FOR BNCWEB

The scripts in the "BNCweb/" subdirectory encode the BNC for use with the
BNCweb interface. Please follow the instructions given in the BNCweb
distribution. If you need more information about available program options,
type

    perldoc BNCweb/EncodeBNC.perl
    perldoc BNCweb/MakeFreqTables.perl

In particular, you should specify an appropriate value for --memory (-M) and
may wish to see progress messags (--verbose, -v). If you change any of the
other default options, you will probably have to adjust the BNCweb
configuration and the further setup procedure.


CORPUS FORMAT

    ** TODO **

A detailed description of the corpus format for interactive use
("EncodeBNC.perl") will be added here, listing all positional and structural
attributes, as well as tag sets and metadata categories. This description will
also be available as an INFO file within CQP (type "info BNC;"). Some
information can already be found in the internal manpages:

    perldoc lib/BNC/Doc.pm 

The CWB corpus generated by the BNCweb encoder ("BNCweb/EncodeBNC.perl") is
intended for exclusive use by the BNCweb interface. Therefore, no user
documentation is made available for this corpus.


COPYRIGHT AND LICENSE

Copyright (C) 2007−2008 by Stefan Evert

This program is free software; you can redistribute it and/or modify it under
the same terms as Perl itself, either Perl version 5.8.6 or, at your option,
any later version of Perl 5 you may have available.
