UCREL Significance Test System

Background on significance testing

What is a significance test?

A statistical significance test allows you to look at two variables and ask, “are these variables related or unrelated?” – or to put it another way, are they independent of one another or not?

A typical example would be a medical study comparing two groups of people – a group who were given a particular treatment, and a group who weren't. The first variable is thus the group each person was in – group A or group B. The second variable is what we measure, in this case, the outcome (did the person recover from their illness, or not?) If the variables are independent, then what group the person was in makes no difference to whether they recovered or not. On the other hand, if the variables are not independent, then the treatment did make people more (or less!) likely to recover.

Often with real life data, it can be hard to be sure whether the numbers we've observed give us enough evidence to be confident that two variables are not independent. For that reason, we do statistical tests to find out whether differences between groups or categories in the data are significant.

When we do a significance test, we work out how probable our observed data would be on the assumption that the variables are independent. If our observed data is very unlikely under that scenario, we can rule out that working assumption, and conclude that the variables are not independent: the first variable has an effect on the value of the second variable (or vice versa).

In linguistics, we often count linguistic units such as instances of a particular word or grammatical construction or usage pattern, and test the significance of the use of that unit versus other units, across two or more groups of speakers, or across two or more types of text, and so on.

This online tool uses three kinds of tests that are common in corpus lingusitics: chi-squared, log-likelihood, and the Fisher exact test. However, many other statistical significance tests do exist, e.g. the Z-test or the T-test. All significance tests look at a contingency table with one variable represented as rows, and the other as columns.

Understanding a p-value

The chi-squared and log-likelihood tests generate a “statistic” which can be translated into a p-value (the translation procedure is the same for both). The Fisher exact test generates a p-value directly. But what is a p-value?

The p-value is the probability of getting the observed values in the contingency table if the two variables are independent. Like all probability scores, it is between zero (no chance at all of getting these observed values) and one (absolute certainty of getting these observed values). If the p-value is sufficiently low, it is very unlikely that we would see the observed data unless the variables were related; therefore, we conclude that the variables are related.

To put it another way, the significance test is passed if the p-value is below a certain cut-off, and failed if it is above that cut off.

When one of the variables represents a group or category, we can say that there is a significant difference between the groups if the p-value is below the cut-off.

What cut-off point we use is a matter of convention. The usual cut-off is p < 0.05, i.e., a less than five percent chance of observing this data if there is no connection between variables. In this case, we say that the result is significant at the 5% level.

However, in some research, much more stringent cut-offs are used, such as 1% or 0.1%.

p < 0.05 — the results are significant at the 5% level
p < 0.01 — the results are significant at the 1% level
p < 0.001 — the results are significant at the 0.1% level
p < 0.0001 — the results are significant at the 0.01% level

About the tests

The chi-squared test is traditionally the first significance test that people learn when studying statistics. It is very widely used in a variety of fields. However, it has a number of known flaws, especially for use in corpus linguistics. First, it is not reliable if the contingency table contains any small numbers (rule-of-thumb: less than 5); this is especially the case when the sum of all the cells in the table is very large, which it usually is for corpus data. Second, the chi-squared statistic only approximates the results that the other two tests given here calculate more precisely.

The log-likelihood test, also known as the G-squared or G2 test, is not as widely used as the chi-squared test, although it has been fairly common in corpus linguistics and computational linguistics ever since 1993, when Dunning wrote a paper arguing that the log-likelihood test is better than the chi-squared test for corpus data. In general, it can be said that log-likelihood is never a worse test than chi-squared, and is sometimes better, because the drawbacks associated with the chi-squared test do not apply to log-likelihood.

The Fisher exact test is a test which calculates a p-value, but does it directly by calculating the exact probability – thus the name – rather than producing a test statistic which can be translated into a p-value. In the past, the Fisher exact test was rarely used because it requires much more computing power than either of the other two tests. However, the onward march of technology means that lack of computer power is no longer a problem. Importantly, the Fisher exact test still works even if the contingency table contains one or more very low numbers.

How to choose a test

The advantages and disadvantages of each test are outlined above. However, the following two principles should normally be followed:

Within a single analysis, choose one test and stick to it. Do not switch back-and-forth between different tests.
You must choose which test you want to use before you get the results. Changing your choice of test after the fact – for instance, switching to a test which makes your numbers significant – counts as “cheating”.
An exception to the above: if the Fisher exact test cannot be calculated because the system runs out of memory, that indicates it is probably an OK situation to use the log-likelihood test instead!

Effect size measures

Alongside the test results, the system supplies you with one or more effect size measures. These should be used, alongside the test results, when interpreting the data. While the significance test tells you how confident you are that the two variables are in fact related, the effect size measure tells you how strongly they are related.

For this reason effect size measures can also be called association measures (they measure the strength of the association between the two variables).

There are many different effect size measures; this system uses just four. If your table has two columns and two rows (the most common case) four different measures are calculated. For any other size of table, just one measure is calculated.

Measures for two-by-two tables

Phi coefficient: This is a measure of how strong the correlation is between your two variables. It is always between 0 (no correlation at all) and 1 (total correlation, the two variables are completely linked). The normal interpretation is that values around 0.1 indicate a small correlation, values around 0.3 indicate a medium correlation, and values around 0.5 indicate a large correlation.

Yule’s Q coefficient: This measure ranges from –1 (complete negative correlation) through 0 (no correlation at all) to +1 (complete positive correlation). The strengths of the correlations can be interpreted in the same ways as for the phi coefficient. The difference between the phi coefficient and Yule's Q is that the phi coefficient is based on the chi-squared test result, whereas Yule's Q is not, although both phi and Q can be interpeted as measuring correlation.

Odds ratio: This measure, and the risk ratio discussed below, are used more commonly in medicine than in linguistics. Neverthless we do use it in linguistics sometimes. The odds ratio is 1 when there is no effect and the two variables are not related to one another at all. If there is a positive link (both variables go up together), then the odds ratio is greater than 1. If there is a negative link (one variable goes up as the other goes down), then the odds ratio is a fraction under 1.

The size of the odds ratio can be interpreted as follows: a small effect is about 1.5 (or about 0.66 if it's a negative link); a medium effect is about 3.5 (or about 0.28 if it's a negative link); a large effect is about 9 (or about 0.11 if it's a negative link). As you can see, a higher odds ratio is a stronger association for positive links, but a lower odds ratio is a stronger association for negative links.

Risk ratio: This is the least common effect measure; its other name is relative risk, and in corpus linguistics it is sometimes called the ratio of relative frequencies (RRF) (these all mean the same thing). The risk ratio indicates how many times more likely an observation is to fall in column 1 if it is in row 1 than if it is in row 2 (in medicine, for instance, how many times more likely is a patient to recover if they have taken the drug than if they have not taken the drug?).

The interpretation is similar to odds ratio: 1 means no effect, a fraction less than 1 is a negative effect, and a number above 1 is a positive effect. However, there is an important difference to odds ratio: with risk ratio, whether the effect is positive or negative depends on which variable you have used for the columns and which one you have used for the rows (for most of the other statistics, this doesn't change anything). Risk ratio is the basis for the Log Ratio statistic for keywords and collocations in corpus linguistics. Otherwise it is not much used in linguistics.

Usually, all four measures will tell you the same story – they are all based on the same numbers, just combined in slightly different ways.

Measures for all other tables

Cramér’s V: Cramér’s V is equivalent to the Phi coefficient, but for bigger tables. It should be interpreted in mostly the same way.

(Unimportant) technical details on this system

This system is mostly written in JavaScript, using the JQuery library. The server-side script is written in PHP.

All the actual statistical calculation is done using the R environment for statistical computing. The RFace library is used to control R from within the PHP script. RFace (“R interface”) was created as part of CQPweb, but can be used to interface any PHP application to R. Its main job is to translate data from PHP to R and back again.

The chi-squared test and the Fisher exact test are performed using built-in R functions. The log-likelihood test is calculated using a custom-made R function, which you can see by clicking here.

Welcome to the UCREL Significance Test System

Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using R

The results of your significance test are as follows:

Effect size measures for this table are as follows: