Lancaster Stats Tools online

Toolbox


1
Introduction Statistics meets corpus linguistics

1. Type in a mathematical expression to be calculated. For help click here.


1. Paste tab delimited data including header row and id column. For help click here.


2. Select parameters.

One linguistic variable Multiple linguistic variables (relationship)

Description Inference




R code #histogram
hist(x, breaks="Sturges", col="gray", xlab="linguistic variable", main="Histogram")

#boxplot with points and mean overlay
boxplot(myData, ylab = "linguistic variable",xlab="(sub)corpora", outline = FALSE, ylim=c(0, max(myData, na.rm=TRUE)*1.05)); i = 1;while(i <= ncol(myData)) { for(v in myData[,i]){points(jitter(i,3/i),v, col = "blue", pch=1, cex = 1)};
points(i, mean(myData[,i],trim = 0, na.rm = TRUE), col = "red", pch="_", cex = 4) i= i+1; }

#scatter plot with regression line
plot(myData); fitline <- lm(myData[,2] ~ myData[,1]); abline(fitline,col="red")

#error bars
error.bars(myData,stats=NULL, ylab = "linguistic variable",xlab="(sub)corpora", main=NULL,eyes=FALSE, ylim = NULL, xlim=NULL,alpha=.05,sd=FALSE, labels = NULL, pos = NULL, arrow.len = 0.05,arrow.col="red", add = FALSE,bars=FALSE,within=FALSE, col="red")
[More]

1. Select what you want to randomize. For help click here.


2. Paste data in the text area and choose what you want to randomize. characters left.


Lines Words Sentences


Stats calculator

The video can be downloaded here. Stats calculator data [txt].

Graph tool

The video can be downloaded here. Graph tool data [csv] [xlsx].

Randomiser

The video can be downloaded here. Randomiser data [csv].

2
Vocabulary Frequency and dispersion

1. Paste the text you want to analyse into the text box below. characters left.


2. Choose language.


3. Change parameters or leave default options.

a) Case sensitive types

b) TTR normalization basis

c) Word delimiters (in addition to white space)




1. Paste the text you want to analyse into the text box below. characters left.

2. Choose language.

3. Define a word.

case sensitive (types)

4. Choose the basis for normalisation.










Word calculator

The video can be downloaded here.

3
Semantics and discourse Collocations,keywords and lockwords

1. Enter parameters for collocate calculation. For help click here.

A) Tokens in the corpus
B) Frequency of
C) Frequency of
D) Frequency of the collocation (node + collocate)
E) Window size L R
F) Correction for window size

#LancsBox is a free multi-platform tool for the analysis of language. #LancsBox, among other things, identifies collocations and keywords. You need to download #LancsBox to your computer.

1. Paste tab delimited data including header row and id column. For help click here.


2. Select the type of judgement variable.

Nominal variable (categories) Ordinal variable (ranks) Interval/ratio variable (scale)




R code # R functions: http://www.agreestat.com/r_functions.html #nominal
gwet.ac1.raw(myData)
myData1<-table(myData); kappa2.table(myData1)
#nominal 3 + raters
fleiss.kappa.raw(myData)
#ordinal
gwet.ac1.raw(myData, weights="ordinal")
#scale
ICC(myData)

#LancsBox

The video can be downloaded here.

4
Lexico-grammar From simple counts to complex models

1. Copy paste your data in the box below. For help click here.


1. Paste tab delimited data including header row and id column. For help click here.


2. Select options.

Input format of the data: Cross-tab Data set

Test: Chi-squared Chi-squared (Yates's correction) Log likelihood Fisher exact test

Visualize relationship




R code source("http://corpora.lancs.ac.uk/stats/r_functions/loglik.r");
source("http://corpora.lancs.ac.uk/stats/r_functions/CramerV.r");
source("http://corpora.lancs.ac.uk/stats/r_functions/riskratio.r");
#cross-tabulate
data<- table(data)
#statistical tests
chisq.test(data, correct = FALSE);
chisq.test(data, correct = TRUE);
g.test(data, correct = "none");
fisher.test(data);
#effect sizes
CramerV(data, conf.level = 0.95);
riskratio(data);

1. Select what you want to do. For help click here.


2. Paste data in the text area.

3. Type in the exact name of the outcome variable.

4. Type in the exact name(s) of the predictor(s) [use ; as separator].

5. Decide if you want to include predictor interactions.

Yes, include all Yes, include some No
6. Type in the exact names of the predictors with interactions [use ; as separator].


5
Register variation Correlation, clusters and factors

1. Paste tab delimited data including header row and id column. For help click here.

2. Select options.

Parametric Non-parametric

Visualize correlation


R code library(Hmisc); library(corrplot); library(stats); #libraries used
cor.test(mydata1, mydata2, method="pearson") #Pearson's correlation
cor.test(mydata1, mydata2, method="spearman") #Spearman's correlation
rcorr(mydata, type="pearson") #correlation matrix
plot(mydata, col ="blue"); fitline <- lm(mydata1 ~ mydata2); abline(fitline,col="red") #scatter plot
corrplot(m, method ="color", type = "full", diag = TRUE, addCoef.col="black", addCoefasPercent=FALSE, addgrid.col="grey", tl.pos = NULL, tl.cex = 1, tl.srt = 45, tl.col = "black") #correlation matrix

1. Paste tab delimited data including header row and id column. For help click here.


2. Select parameters.

Transform data to z-scores

3. Select highlight.




R code mydata <- scale(mydata) # optional z-score transformation
d <- dist(mydata, method = "manhattan") # distance matrix
fit <- hclust(d, method="ward.D") #Cluster analysis
plot(fit, xlab="", ylab="Height", main="")#plot dendrogram
rect.hclust(fit, k=5, border="red") #draw cluster groups

1. Paste tab delimited data including header row and id column. For help click here.


2. Select the type of analysis you want to carry out.

Full MD Comparison with Biber's (1988) dimensions




R code cortest.bartlett(mydata); det(cor(mydata))# Bartlett's test and multi-colinearity test
fa.parallel(mydata, fa="fa", main = "Scree Plot", show.legend=FALSE) #screeplot
factanal(mydata, number, rotation="promax") #factor analysis

Multidimensional analysis [data]

6
Sociolinguistics Individual and social variation

1. Paste tab delimited data including header row and id column. For help click here.


2. Select data options.

Different groups Same group different conditions

3. Select type of test.

Parametric test Non-parametric test




R code #t-test
t.test(data[ ,1], data[ ,2], paired=FALSE)
#t-test: repeated measures
t.test(data[ ,1], data[ ,2], paired=TRUE)
#Mann-Whitney-wilcoxon rank sum test
wilcox.test(data[ ,1], data[ ,2], paired=FALSE)
#Mann-Whitney-Wilcoxon rank sum test: repeated measures
wilcox.test(data[ ,1], data[ ,2], paired=TRUE)
#One-way ANOVA
aov(measurement ~ group, data = data)
#Kruskal-Wallis test
kruskal.test(data)

1. Paste tab delimited data including header row and id column. For help click here.



R code library(languageR);
x = corres.fnc(data);
plot(x, ccex = 0.6, rcex = 0.6);

1. Paste data in the text area. For help click here.

2. Type in the exact name of the outcome variable.

3. Type in the exact name(s) of the fixed effect predictor(s) [use ; as separator].

4. Type in the exact name(s) of the random effect predictor.

5. Decide if you want to include predictor interactions.

Yes, include all Yes, include some No
6. Type in the exact names of the predictors with interactions [use ; as separator].


R code library(lme4);
glmer(outcome~predictor+(1|randeffect), family = binomial, data = mydata);

7
Change over time Working with diachronic data

1. Paste tab delimited data including header row and id column. For help click here.


2. Select parameters.


Difference between two corpora (two-tailed)
Increase between corpus 1 and corpus 2 (one-tailed)
Decrease between corpus 1 and corpus 2 (one-tailed)




R code library(boot);
source("http://users.ics.aalto.fi/lijffijt/bootstraptest/bootstraptest.R");
bootstraptest(period1, period2,samples,'p2');
percid(b);
boot(data=b, statistic=percid, R=samples);

1. Paste tab delimited data including header row and id column. For help click here.


2. Select parameters.




R code source("http://corpora.lancs.ac.uk/stats/r_functions/VNC.r");
abc(data,"sd");

1. Paste tab delimited data including header row and id column. For help click here.


2. Select parameters.

No transformation Log transformation




R code library(ggplot2)
library(mgcv)
p<-ggplot(data, aes(x = data[,1], y =data[,2])) + geom_point() + xladata("Time") + yladata("Linguistic variadatale"); p + stat_smooth(method = "gam", formula = y ~ s(x, datas = "cr", fx=FALSE, k =15), size = 1, fill="#707070", level = 0.95 )+ stat_smooth(method = "gam", formula = y ~ s(x, datas = "cr", fx=FALSE, k =15), size = 1, fill="#FFFF00", level = 0.99);

1. Indicate historical period. For help click here.


2. Upload a zip file with collocation files.


3. Provide info about data.

Regex for identifying collocates:

Column delimiter:


4. Define a collocate.

sampling points

% sampling points


5. Decide if you want to run the analysis with frequency cut-off point.

Yes, absolute cut-off Yes, relative cut-off No

6. Provide additional info.

Regex for identifying node frequency in header (relative cut-off):


R code #Calculate Gwet's AC1; b...input data frame
i = 1; v <- c(); while(i+1 < ncol(b)) {n=(gwet.ac1.raw(b[,i:(i+2)])[3]);v<- c(v, n); i= i+1; }
#Prepare data frame
h<-seq(from, to, by = 1); g<-data.frame(h,v) #Produce graph
p<-ggplot(g, aes(x = g[,1], y =g[,2])) + xlim(from, to)+ scale_x_continuous(breaks = seq(from, to, by = 10)) + geom_point() + xlab("Time") + ylab("AC1"); p + stat_smooth(method = "gam", formula = y ~ s(x, bs = "cr", fx=FALSE, k =10), size = 1, fill="#707070", level = 0.95 )+ stat_smooth(method = "gam", formula = y ~ s(x, bs = "cr", fx=FALSE, k =10), size = 1, fill="#FFFF00", level = 0.99)

8
Bringing everything together Ten principles of statistical thinking, meta-analysis and effect sizes

1. Choose input type for the calculation of effect size. For help click here.


2. Insert required value or values. Separate multiple values by a semi-colon (;).




R code library(compute.es)
pes(p,n1,n2) #based on p-value
mes(m1,m2,sd1,sd2,n1,n2) #based on means
tes(t,n1,n2) #based on t-value (t-test)
fes(F,n1,n2) #based on F (ANOVA)
res(r,NULL,n) #based on r (e.g. correlation)
des(d,n1,n2) #based on Cohen's d
lores(lor,var,n1,n2) #based on Log Oddds Ratio
pes(p,n1,n2) #based on p-value
d=(2*r)/sqrt(1-(r*r)) #based on r only
r=d/sqrt((d*d)+4) #based on Cohen's d only
d=(2*sqrt(e))/sqrt(1-e) #based on eta2
d=(lor*sqrt(3))/pi #based on Log Odds Ratio only

Paste a list of studies and their standardised results (d, n1, n2). For help click here.




R code library(meta)
#Calculate Variance ES
es.d.v <-(((n1+n2)/(n1*n2))+(es.d^2/(2*(n1+n2))))
#Calculate Standard Errors ES
d.se<-sqrt(es.d.v)
meta1<-metagen(es.d, d.se)
forest(meta1, studlab=c("Study1","Study2","Study3","Study4","Study5"), xlab="Cohen’s d", col.square="black",xlim=c(-3,3), col.diamond="black", fontsize=14, squaresize=0.5, leftcols=c("studlab"), rightcols=c("effect", "ci"), hetstat=FALSE, comb.fixed=FALSE, text.random="Overall ES", print.tau2=FALSE,print.I2=FALSE,TE.random=FALSE, seTE.random=FALSE)

The video can be downloaded here.

The video can be downloaded here.