Website contents

Part 2: Accessing and analysing corpus data

“Corpus linguistics doesn't mean anything. [...] Maybe the sciences should just collect lots and lots of data and try to develop the results from them. Well if someone wants to try that, fine. They're not going to get much support in the chemistry or physics or biology department. But if they feel like trying it, well, it's a free country, try that. We'll judge it by the results that come out.” (Chomsky interviewed by Andor 2004: 97).

Before the mid twentieth century, linguists made use of a mix of observed data and invented examples. Some areas, such as field linguistics (e.g. Boas 1940) or the study of child language acquisition (e.g. Stern and Stern 1907, Templin 1957), relied almost exclusively on observed language data in this period.

But from the mid-twentieth century the impact of Noam Chomsky's views on data in linguistics (see right!) promoted introspection as the main source of data in linguistics at the expense of observed data. More recently, and in particular since about 1980, objections such as Chomsky's have been reassessed and to some extent rejected, and corpora have come into wide-scale use in linguistics.

In this section, we'll be looking at three important issues that arise when we access and analyse corpus data.

How can we make use of corpus metadata, markup and annotation?
What kinds of corpus analysis software are available, and what can they do?
What do we need to know about statistics in corpus linguistics?

This page was last modified on Sunday 30 October 2011 at 9:18 pm.