Next: Written dataspoken data Up: Text Corpora Working Group Previous: Cross-group on spoken texts

What is a corpus?

In the EAGLES recommendations on corpus typology (EAGLES, 1996e), a corpus is defined as:

Corpus: A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.

Words such as `collection' and `archive' refer to sets of texts that do not need to be selected, or do not need to be ordered, or the selection and/or ordering do not need to be on linguistic criteria, They are therefore quite unlike corpora.

Linguistic criteria to be applied to the selection and ordering may be:

External: -- in that they concern the participants, the occasion, the social setting or the communicative function of the pieces of language;
Internal: -- in that they concern the recurrence of language patterns within the pieces of language.

These criteria are reviewed in more detail in the recommendations on corpus typology (EAGLES, 1996e) where a classification of different types of corpora can also be found.

Since this document is devoted to computer corpora, it is appropriate to start by the definition also proposed in the above document:

Computer corpus: a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks.