Statistics in corpus-based sociolinguistics: A practical workshop

18th August 2014, Linguistics Summer School in Dacice

Workshop convenor: Vaclav Brezina, ESRC Centre for Corpus Approaches to Social Science, Lancaster University


  • Step-by-step introduction to the most important statistical techniques
  • New free software tools
  • Hands-on exercises

This workshop will discuss different statistical procedures available for analysis of sociolinguistic data in large language corpora. I will demonstrate that the traditional approach of using aggregated data with the log-likelihood statistic is in principle unreliable. Instead, the workshop will offer suggestions for alternative methodologies and statistical procedures, which take into account within group differences and therefore produce more meaningful results. As part of the workshop, a new research tool BNC64 Search & Compare will be introduced. BNC64 Search & Compare can carry out detailed analyses based on a socially-balanced spoken corpus BNC64 (1.5 million words). BNC64 represents the speech of 64 speakers - 32 men and 32 women - extracted from the British National Corpus (BNC). BNC64 Search and Compare is a web-based environment that creates simple visualisations, calculates statistics and produces concordances. The website was created to allow for easy visualisations of complex corpus data and easy testing of a number of different sociolinguistic hypotheses. The workshop will be structured around a series of practical exercises guiding the participants through different types of analysis of corpus data and statistical procedures. The following areas will be covered:

Statistics covered: Log-likelihood, Mann-Whitney U test, Spearman's rank correlation, Confidence Intervals, Robust mean difference, Robust Cohen's d

The workshop does not require any prior knowledge of statistics. It will be of interest to anyone who wants to explore sociolinguistic data using language corpora.


Workshop materials


BNC64 Search & Compare


Stats tools

Calculator: manual calculations  
Mean, trimmed mean & robust mean difference: simple comparisons  
Robust Cohen's d: effect size  
Mann-Whitney U test: non-parametric test  
Log-likelihood: general corpus comparison  



(c) Vaclav Brezina 2017