BNCweb Installation Guide (Version 5.0, January 2010)

Written by Sebastian Hoffmann and Stefan Evert.
Last updated: 01/01/2010


Please report errors and problems to: bncweb@mac.com


PREREQUISITES

xsltproc (from the Gnome LibXSLT package)
MySQL 5.1 or higher (http://www.mysql.com)
Perl 5.8 or higher
Apache 2.x Web server

Perl modules:

- DBI
- DBD::mysql
- HTML::Entities

You will also need a working C compiler that is compatible with your version
of Perl.  On Mac OS X, install the XCode tools available from Apple; on Linux
(especially Ubuntu), make sure that the "gcc" package is installed (some other
packages may also be needed).


IMPORTANT NOTES

 * BNCweb is NOT compatible with versions of MySQL earlier than 5.1!

 * On Mac OS X 10.5 (Leopard), you need to install a 32-bit version of MySQL
   5.1, as the 64-bit version is not compatible with Apple's Perl interpreter
   and the installation of DBD::mysql will fail in this case.  Mac OS X 10.6
   (Snow Leopard) works both with 32-bit and 64-bit versions of MySQL.

 * The necessary Perl modules can easily be installed with the command-line
   utility "cpan".  If you're using it for the first time, follow the
   configuration instructions on screen.  Then type "install DBI" etc.

 * In order to install Perl modules -- as well as for many of the other
   installation steps -- you will have to log in as an administrative user
   ("root", or possibly a special admin user on a server computer).
   Alternatively, you can run individual commands that lead to permission
   errors with "sudo" (e.g. "sudo cpan" instead of "cpan"). 

 * When installing the DBD::mysql module, the MySQL server must be running
   on your computer, otherwise the tests will fail.  You can safely ignore
   one or two test errors (due to permissions settings of the MySQL test
   account), but do not proceed if there is a substantial number of test
   failures!

 * If you're upgrading an existing BNCweb installation and want to preserve
   user accounts and saved queries, please refer to the "UPGRADING" section
   below.


STEP-BY-STEP INSTALLATION GUIDE

1. Unpack the BNCweb distribution in a directory on your Web server.  Consult
   the documentaton of your operating system to locate the appropriate
   directory tree. On Mac OS X, it is "/Library/Webserver/"; on Linux, it is
   often located in "/var/www/" or a similar directory.

   In the following, we will assume that you are installing BNCweb on Mac OS X
   in the directory "/Library/Webserver/bncweb/".  The file you're reading now
   should be accessible under the name "/Library/Webserver/bncweb/readme.txt".

2. Install the IMS Corpus Workbench and CWB/Perl libraries (CWB and CWB-CL),
   following the instructions on http://cwb.sf.net/.  BNCweb requires 
   VERSION 2.2.101 or newer in order to work.  If is recommended that you
   install both the CWB and the Perl modules with standard settings.

   If you're installing the CWB for the first time, you will also have to create
   the registry directory "/usr/local/share/cwb/registry".  It is recommended
   to create a second directory "/usr/local/share/cwb/data" for CWB index
   files, unless you have reserved a separate hard disk / partition for corpus
   data.  In the following, we assume that you have created these directories
   and want to use them -- otherwise, adjust the commands as needed.
   
   You may also want to install one of the CWB example corpora and check that
   you can access it with CQP (see the CQP Query Tutorial for usage
   information and a few example queries).

3. Copy the original texts of the British National Corpus (XML Edition) from
   the "Texts/" directory of the DVD.  You do not need to install anything
   else (index, documentation, etc.) unless you also wish to access the BNC
   with Xaira or similar corpus tools (the CWB requires a separate indexing
   process).  In the following, we assume that you have put these files in the
   directory "/usr/local/share/corpora/BNC/Texts/".

4. Change to the directory "BNC_encoder/" and run the scripts
   "BNCweb/EncodeBNC.perl" and "BNCweb/MakeFreqTables.perl" (in this order),
   in order to index the BNC for use with CQP and to generate several
   frequency tables needed by BNCweb.

   IMPORTANT NOTE: The CWB indexer writes a large number of disk files, which
   may exceed limits set by your operating system (notably Mac OS X).  In this
   case, type "ulimit -n 512" before running "BNCweb/EncodeBNC.perl".  (You
   can check the default limit with "ulimit -a"; the entry for "open files"
   should be at least 400.)
   
   In the simplest case, you can run the scripts with the following commands
   (assuming that you're using the standard CWB directories as explained
   above):

       perl BNCweb/EncodeBNC.perl -v /usr/local/share/cwb/data/BNC/ /usr/local/share/corpora/BNC/Texts/

       perl BNCweb/MakeFreqTables.perl -v

   Here is what this step could look like if you need some more specific
   choices: 

       perl BNCweb/EncodeBNC.perl -n BNC-XML -t /Corpora/Tables/bncweb -M 1000 -f -v /Corpora/Data/BNCweb/ /Corpora/Sources/BNC/Texts/

       perl BNCweb/MakeFreqTables.perl -n BNC-XML -t /Corpora/Tables/bncweb -M 1000 -f -v

   In this example, the following changes have been made:

       - the corpus will be named "BNC-XML" rather than "BNCWEB" in CQP (-n)

       - the frequency tables will be named "bncweb_*" and stored in directory
         "/Corpora/Tables/" (-t)

       - the CWB indexer uses about 1 GB of RAM to speed up indexing (-M)

       - if the data directory, registry entry, etc. already exist, they will
         be overwritten without warning (-f)

       - the corpus data files will be stored in (a subdirectory of)
         "/Corpora/Data/BNCweb/", but the registry entry will still be written
         to the standard directory "/usr/local/share/cwb/registry/"

       - the original BNC data are kept in directory "/Corpora/Sources/BNC/"
   
    Please consult "perldoc BNCweb/EncodeBNC.perl" and "perldoc
    BNCweb/MakeFreqTables.perl" for further information about the available
    options.

    Now fetch a cup of coffee: it will take several hours for the script
    "EncodeBNC.perl" to complete, because it has to perform a complex analysis
    and transformation of the BNC annotation.

    After encoding, return to the root directory of the BNCweb distribution.

5. Create a data storage directory for BNCweb, which will be used for saved
   queries as well as for temporary and cache files.  A recommended standard
   location is "/usr/local/share/bncweb/" (assumed in the following steps),
   but keep in mind that this directory may need to hold several gigabytes of
   cached query results.

   It is important that both the Web server and MySQL have read/write access
   to this directory.  An easy, but somewhat dangerous solution is to make the
   data directory world-writable:

       chmod 777 /usr/local/share/bncweb/

   A better solution is to create a new user group, say "bncweb", to which
   both the Web server and MySQL belong.  It is then sufficient to give full
   access to group members, while shutting out all other users:

       chgrp bncweb /usr/local/share/bncweb/
       chmod 770 /usr/local/share/bncweb/

   If you want to run the MySQL server on a different computer (e.g. if your
   university department or computing centre has a dedicated database server),
   the data directory has to be shared betwen the two computers, e.g. by
   mounting it as a NFS network drive on the database server.  You still have
   to ensure that both the Web server and the MySQL server have read/write
   access, which can be tricky (because the two computers may have different
   users and groups).  If you cannot mount the data directory under the same
   name on both computers, set the variable $bwMySQLTempPath (from the
   perspective of MySQL) differently from $bwTempPath (from the perspective of
   the Web server) when you configure BNCweb in the following step.

6. Edit the BNCweb configuration file "lib_files/bncConfigXML.pm" according to
   instructions in the inline comments.  Make sure that you set at least the
   following variables correctly:

       $bwServer         IP address of the Web server on which BNCweb runs
       $bwWebServer      public URL of this Web server
       $bwCGIalias       relative URL of main BNCweb interface
       $bwHTMLalias      relative URL of static BNC data files

   For instance, if you set $bwWebServer to "http://bncweb.lancs.ac.uk/" and
   $bwCGIalias to "cgi-binbncXML", you will later be able to start up BNCweb
   at the URL "http://bncweb.lancs.ac.uk/cgi-binbncXML/BNCweb.pl".  You are
   free in your choive of relative URLs here: they will later be inserted into
   the Web server configuration.

       $bwTempPath       BNCweb data directory ("/usr/local/share/bncweb/" above)
       $bwCorpus         CWB name of indexed BNC corpus (default: "BNCWEB")

   You need to change $bwMYSQLhost only if MySQL runs on a different
   computer.  The following configuration variables specify names for BNCweb's
   databases and only need to be changed if there is a collision or naming
   inconsistency with other databases.

       $bwMysqlUser      username for MySQL login (recommended: bncweb)
       $bwMysqlPwd       password of this user

   $bwMysqlUser specifies the account which BNCweb uses to log in to the MySQL
   database server.  If this is an existing account, enter its password here.
   Otherwise, the account will automatically be created by the setup script in
   the following step.  In this case, set $bwMysqlPwd to a "random" string
   that cannot easily be guessed by unauthorised persons.

       $bwSuperuser      master BNCweb account (for BNCweb administrator)
   
   The following options specify various default limits so that users cannot
   put excessive load on the server.  These limits can be overridden for
   individual users by the BNCweb administrator (via the admin Web interface).
   Finally, $bwFileSize, $bwMaxMySQLSize and $bwMaxFreqlistMySQL determine the
   amount of disk space BNCweb uses to cache query results and frequency
   tables.  For best performance, make these as large as you can afford -- the
   public BNCweb server at Lancaster uses several gigabytes of disk space for
   each cache.
       
7. Run the script "setup/make_MySQL_tables.pl" to set up the databases used by
   BNCweb, configure the MySQL account $bwMysqlUser, and import the metadata
   and frequency tables created in Step 4.

       perl setup/make_MySQL_tables.pl

   You will be asked to enter the username and password of a MySQL user with
   administrator privileges ("CREATE DATABASE..." and "GRANT ALL/FILE..."), so
   that the MySQL account and database tables can be configured automatically.
   You will also be asked for the location and prefix of the metadata and
   frequency tables created during the BNC indexing process.  Unless you have
   used the -t option in Step 4, you can just accept the default setting.

   If you do not have admin access to the MySQL server, please ask your
   sysadmin to set up a MySQL account $bwMysqlUser with password $bwMysqlPwd
   (or insert the information for an existing account in the BNCweb
   configuration file), and then execute the SQL commands below.  The
   variables in these command (such as $bwMYSQLtable) must be replaced by the
   corresponding settings in "lib_files/bncConfigXML.pm".

       DROP DATABASE if exists $bwMYSQLtable;
       DROP DATABASE if exists $bwMYSQLusertable;
       DROP DATABASE if exists $bwMYSQLcategorizetable;
       DROP DATABASE if exists $bwMYSQLfrequency;
       
       CREATE DATABASE $bwMYSQLtable CHARACTER SET latin1 COLLATE latin1_general_ci;
       CREATE DATABASE $bwMYSQLusertable CHARACTER SET latin1 COLLATE latin1_general_ci;
       CREATE DATABASE $bwMYSQLcategorizetable CHARACTER SET latin1 COLLATE latin1_general_ci;
       CREATE DATABASE $bwMYSQLfrequency CHARACTER SET latin1 COLLATE latin1_general_ci;
       
       GRANT ALL ON $bwMYSQLtable.* TO '$bwMysqlUser'@'$bwServer' IDENTIFIED BY '$bwMysqlPwd';
       GRANT ALL ON $bwMYSQLusertable.* TO '$bwMysqlUser'@'$bwServer' IDENTIFIED BY '$bwMysqlPwd';
       GRANT ALL ON $bwMYSQLcategorizetable.* TO '$bwMysqlUser'@'$bwServer' IDENTIFIED BY '$bwMysqlPwd';
       GRANT ALL ON $bwMYSQLfrequency.* TO '$bwMysqlUser'@'$bwServer' IDENTIFIED BY '$bwMysqlPwd';
       GRANT FILE ON *.* TO '$bwMysqlUser'@'$bwServer' IDENTIFIED BY '$bwMysqlPwd';

   After this has been done, run the script "setup/make_MySQL_tables.pl" again.

8. Configure your Apache Web server by editing the file "httpd.conf".  You can
   usually find this file in "/etp/apache2/httpd.conf", but your operating
   system may also have put it in a different location.  If there is a
   subdirectory "/etc/apache2/other/" (e.g. on Mac OS X), it should be
   sufficient to create a new file "bncweb.conf" there and insert the
   configuration settings below (so that you do not have to fiddle with the
   master "httpd.conf" file).

   On Mac OS X with the default directories and configuration settings, the 
   Apache configuration for BNCweb might look as follows:

       Alias /bncweb/ /Library/WebServer/bncweb/
       Alias /bncweb /Library/WebServer/bncweb/
       ScriptAlias /bncweb-cgi/ /Library/WebServer/bncweb/cgi-bin/	
       
       <Directory /Library/WebServer/bncweb>
               Options Indexes FollowSymLinks ExecCGI
               Order deny,allow
               Allow from all
               AuthType Basic
               AuthName bncweb
               AuthUserFile /etc/bncpass
               require valid-user
       </Directory>

   The relative URLs in the "Alias" lines must be the same as given in the
   configuration variable $bwHTMLalias, and the one in "ScriptAlias" must be
   the same as $bwCGIalias.  The directory in these lines
   ("/Library/WebServer/bncweb/" above) is the root directory of your BNCweb
   installation on the Web server.

   You will also have to create a file with usernames and passwords for all
   users of your BNCweb server under the name "/etc/bnpass" (or change the
   filename in the configuration example above if you want to put it in a
   different location).  Use the "htpasswd" tool, which should be installed on
   your computer, to generate entries in an appropriate format.  IMPORTANT:
   you must at least insert an entry for $bwSuperuser in this file!

   *****************************************************************************
   It is important that the Web server setup for your installation of BNCweb
   requires user authentification (as our sample configuration for Apache
   does).  Apart from licensing issues if your server is publically accessible
   via the Internet, this is also necessary for a stand-alone BNCweb server on
   a private computer.  BNCweb expects you to be logged in with a username and
   some of its functionality may not work properly without authentification.
   *****************************************************************************

   If you use a non-standard registry directory (say, "/Corpora/Registry/")
   and/or Perl modules (e.g. the Perl/CWB libraries) installed in a
   non-standard location (say, "/Corpora/Perl5Lib/"), you will have to add the
   following lines to the configuration:

       <Directory /Library/WebServer/bncweb>
               SetEnv PERL5LIB /Corpora/Perl5Lib
               SetEnv CORPUS_REGISTRY /Corpora/Registry
       </Directory>

   In oder to be able to set environment variables with "SetEnv", you may have
   to uncomment the following lines in your Apache configuration:

       LoadModule env_module libexec/httpd/mod_env.so
       AddModule mod_env.c

   The BNCweb directory on the Web server MUST contain all items of the BNCweb
   distribution except for the directories BNC_encoder/ and setup/ as well as
   the file "readme.txt".

   When the configuration is complete, you must RESTART your Web server in
   order for the changes to take effect.

9. Unless you require them for a different purpose, you can now delete the
   original BNC source files.  You can also delete the temporary files in
   "BNC_encoder/tables" (or the entire "BNC_encoder/" subdirectory) once the
   installation has been successfully tested.  Keep in mind that these files
   can only be re-created by repeating the entire time-consuming indexing
   process, so you may want to keep a copy of the "tables/" directory.

You can now access BNCweb at http://your.server.name/cgi-alias/BNCweb.pl


UPGRADING

If you want to upgrade an existing installation of BNCweb (CQP edition), the
installation process is somewhat different from the step-by-step guide above.
Please follow the steps below.  AN UPGRADE IS NOT POSSIBLE FROM VERSIONS
BEFORE THE FIRST PUBLIC RELEASE OF BNCweb (CQP) IN NOVEMBER 2007 (V4.0)!

U1. Make sure that all prerequisites are installed.  In particular, you may
    need to upgrade MySQL to version 5.1 or newer.  In this case, make sure to
    copy all BNCweb databases and set up the BNCweb user with the same
    password and access permissions (consult the MySQL documentation or your
    local sysadmin if you need help with this).

U2. Install the latest version of the IMS Corpus Workbench and CWB/Perl
    interface, as described in Steps 2 above.  Make sure that CQP can locate
    the CWB-indexed version of the BNC that your current BNCweb installation 
    is using.  If you encounter any difficulties with registry paths, a simple
    solution is to create the directory "/usr/local/share/cwb/registry" and
    put symbolic links to the correct registry entries (for the BNC and the
    two frequency database corpora used by BNCweb) there.

U3. Make a backup of your current BNCweb installation, then replace it
    completely with the new distribution.  If you did not change the standard
    directory layout of BNCweb, this should be very easy to accomplish, and no
    changes to the Apache Web server configuration should be necessary.

U4. Edit the configuration file "lib_files/bncConfigXML.pm" so that it has the
    same settings as your previous BNCweb installation.  See Step 6 above for
    additional comments and instructions.

U5. Run the script "setup/upgrade_MySQL_tables.pl" in order to update the
    existing BNCweb database tables for compatibility with the current
    release.  IF POSSIBLE, MAKE A BACKUP OF THE DATABASE BEFORE THE UPGRADE.

        perl setup/upgrade_MySQL_tables.pl

    If you want to make sure that the automatic upgrade works correctly, you
    can send us information about your current database by running the script

        perl setup/show_database_info.pl

    This script WILL NOT MAKE ANY CHANGES, so there is no need to worry about
    possible damage.  Send us the output of this script as a text file, which
    allows us to ensure that the format of your database looks as expected by
    the upgrade script.


TROUBLESHOOTING

If you see a server error message when trying to access your new BNCweb
installation, check that file permissions in the "cgi-bin/" subdirectory are
set in such a way that all script files can be read and executed by the Web
server.  The files in "lib_files/" also have to be readable by the Web server.
Make sure that all paths and directories in the Web server configuration are
correct, and remember to restart the Web server after any changes to the
configuration.  

Check the Web server's error log for additional information about errors from
BNCweb scripts, especially if you only get a server error message and no
meaningful report in the Web interface.  The Apache log files are typically
found in "/var/log/apache2/error_log", "/var/log/httpd/error_log", or a
similar location.

The BNCweb scripts assume that the Perl interpreter is located in
"/usr/bin/perl".  If this is not the case, you will have to change the first
line of each CGI script (starting in #!) to reflect your local setup
(e.g. #!/usr/local/bin/perl or #!/sw/bin/perl).

If you see an error message saying that your corpus is "not defined", it is
likely that the path to the corpus registry is not correctly set.  If your
user account has a shell variable $CORPUS_REGISTRY (type "echo
$CORPUS_REGISTRY" to find out), the Web server will need the same path setting
in order for BNCweb to work correctly.  Step 8 above explains how to set
$CORPUS_REGISTRY in the Apache configuration.  Another easy solution is to
create the standard registry directory "/usr/local/share/cwb/registry" if it
does not exist already and put symbolic links to the correct registry entries
there.  If you have not changed the default settings during BNC encoding, the
registry files will be called "bncweb", "bncweb_freq_text" and
"bncweb_freq_spkr".

If you see an error message saying that MySQL cannot exectute an SQL command
containing the string "INTO OUTFILE", check that the data storage directory
from Step 5 has the correct read and write permissions (it must be readable
and writable by the MySQL server).  Also check that $bwMySQLTempPath is set
correctly in the BNCweb configuration (usually, it should be the same as
$bwTempPath).


Some notes on using BNCweb with other Web servers than Apache:

 - Read the Web server documentation to find out about the location of
   configuration files, their syntax and the available options

 - If you cannot set the required aliases, distribute the BNCweb files into
   the HTML and CGI directory trees as follows

    - put "Simple_query_language.pdf", "genres.html", "wz_tooltip.js", and
      "FileMaker_template.zip" go in a subdirecory "bncweb/" of your HTML
      document tree (typically a directory named "Documents/", "http/",
      "html/", "htdocs/" or similar found in your Web server's data directory)

    - copy the entire directory "cgi-bin/" of the BNCweb distribution to a
      subdirectory "bncweb/" of your CGI script tree (typically a directory
      named "CGI-Exectuables/" or "cgi-bin/" found in your Web server's data
      directory)

    - copy the entire directory "lib_files/" to your CGI script tree, i.e.
      it must be a sister directory of the "bncweb/" directory created in the
      previous step (and must be named "lib_files/")

 - In "lib_files/bncConfigXML.pm", make the following settings:
   $bwCGIalias = 'cgi-bin/bncweb'; and $bwHTMLalias = 'bncweb';

    - if you cannot put the "lib_files/" directory in the CGI script tree as
      required above, set the PERL5LIB environment variable in the Web server
      configuration accordingly; consult your Web server's documentation on
      how to set environment variables (e.g., with LightTPD you will have to
      load "mod_setenv" first)

    - on LightTPD, PERL5LIB can't be set because of an internal bug; in this
      case, use PERLLIB instead

    - if it turns out to be absolutely impossible to set environment
      variables, you can also edit _every_ CGI script (named "*.pl")
      distributed with BNCweb and set the correct directory in the line
      use lib "..."; if you need multiple directories, you can duplicate
      this line

 - If you encounter any probles, check your Web server's documentation and FAQ
   carefully.  Googling also helps and often brings up complete recipes.

 - Depending on your CWB installation, you may also have to set the
   CORPUS_REGISTRY environment variable

    - for a well-configured default installation, this shouldn't be necessary
      at all; see previous comments on how to put symbolic links to corpus
      registry files in the standard registry directory
      "/usr/local/share/cwb/registry"; also check that corpus indexing and
      installation of BNCweb was carried out correctly before you spend a lot
      of time trying to set the CORPUS_REGISTRY variable

 - A general remark: even if a Web server release is marked as "stable", it
   doesn't really have to be and may crash unexpectedly.  In that case, try
   upgrading to the latest version (even if this is earmarked as "unstable")
   ... and vice versa, of course.  (This advice is based on first-hand
   experience with LightTPD on Debian Linux.)

 - It is also crucial that you find out how to force users to authenticate
   themselves before accessing BNCweb scripts, as BNCweb needs the login data
   for its user management.  ALL Web servers support authentication, which can
   be activated for individual directories.  Just read the documentation until
   you find the required information.

    - authentication has to be required for the CGI script directory of your
      BNCweb installation; if you followed the recommendations above, this
      will be a subdirectory "bncweb/" of your CGI script tree
