Statistics in corpus-based sociolinguistics: A practical workshop

18th August 2014, Linguistics Summer School in Dacice

Workshop convenor: Vaclav Brezina, ESRC Centre for Corpus Approaches to Social Science, Lancaster University

Step-by-step introduction to the most important statistical techniques New free software tools Hands-on exercises

This workshop will discuss different statistical procedures available for analysis of sociolinguistic data in large language corpora. I will demonstrate that the traditional approach of using aggregated data with the log-likelihood statistic is in principle unreliable. Instead, the workshop will offer suggestions for alternative methodologies and statistical procedures, which take into account within group differences and therefore produce more meaningful results. As part of the workshop, a new research tool BNC64 Search & Compare will be introduced. BNC64 Search & Compare can carry out detailed analyses based on a socially-balanced spoken corpus BNC64 (1.5 million words). BNC64 represents the speech of 64 speakers - 32 men and 32 women - extracted from the British National Corpus (BNC). BNC64 Search and Compare is a web-based environment that creates simple visualisations, calculates statistics and produces concordances. The website was created to allow for easy visualisations of complex corpus data and easy testing of a number of different sociolinguistic hypotheses. The workshop will be structured around a series of practical exercises guiding the participants through different types of analysis of corpus data and statistical procedures. The following areas will be covered:

Sociolinguistic data in language corpora
Descriptive and inferential statistics
Individual and social variation
The null-hypothesis testing paradigm and the "new" statistics

Statistics covered: Log-likelihood, Mann-Whitney U test, Spearman's rank correlation, Confidence Intervals, Robust mean difference, Robust Cohen's d

The workshop does not require any prior knowledge of statistics. It will be of interest to anyone who wants to explore sociolinguistic data using language corpora.

Workshop materials

BNC64 Search & Compare

BNC64 search tool

Stats tools

Calculator: manual calculations

Mean, trimmed mean & robust mean difference: simple comparisons

Robust Cohen's d: effect size

Mann-Whitney U test: non-parametric test

Log-likelihood: general corpus comparison