Workshop convenor: Vaclav Brezina, ESRC Centre for Corpus Approaches to Social Science, Lancaster University


This workshop will discuss different statistical procedures available for analysis of sociolinguistic data in large language corpora. I will demonstrate that the traditional approach of using aggregated data with the loglikelihood statistic is in principle unreliable. Instead, the workshop will offer suggestions for alternative methodologies and statistical procedures, which take into account within group differences and therefore produce more meaningful results. As part of the workshop, a new research tool BNC64 Search & Compare will be introduced. BNC64 Search & Compare can carry out detailed analyses based on a sociallybalanced spoken corpus BNC64 (1.5 million words). BNC64 represents the speech of 64 speakers  32 men and 32 women  extracted from the British National Corpus (BNC). BNC64 Search and Compare is a webbased environment that creates simple visualisations, calculates statistics and produces concordances. The website was created to allow for easy visualisations of complex corpus data and easy testing of a number of different sociolinguistic hypotheses. The workshop will be structured around a series of practical exercises guiding the participants through different types of analysis of corpus data and statistical procedures. The following areas will be covered:
The workshop does not require any prior knowledge of statistics. It will be of interest to anyone who wants to explore sociolinguistic data using language corpora.
Workshop materials
BNC64 Search & Compare
Stats tools
Calculator: manual calculations  
Mean, trimmed mean & robust mean difference: simple comparisons  
Robust Cohen's d: effect size  
MannWhitney U test: nonparametric test  
Loglikelihood: general corpus comparison 