next up previous contents
Next: Simplicity Up: Characteristics Previous: Quantity

Quality

  The default value for Quality is authentic. All the material is gathered from the genuine communications of people going about their normal business. Anything which involves the linguist beyond the minimum disruption required to acquire the data is reason for declaring a special corpus. Such a declaration protects the interest of those who wish to make statements about the way the language is used in ordinary communication, and who might be misled into including data which had arisen in experimental conditions, or in artificial circumstances of various kinds.

It is difficult to draw the line. For example, some television shows deliberately put participants into artificial and indeed bizarre conditions and induce extremely odd responses. Casual conversation is expected to be impromptu but it can be rehearsed by one or more parties.

However difficult round the margins, it is important that serious intervention by the linguist, or the creation of special scenarios, is recorded in the name of the corpus. Experimental corpus may be suggested as a general category.

One well-recognised type of experimental corpus is the speech corpus, which is assembled for the study of fine details of the spoken language. Such a corpus may be very small and be the product of asking subjects to read out strange messages in anechoic chambers. The classification of speech corpora is not the concern of this document. (For more on speech corpora, refer to the reports of the EAGLES Spoken Language Working Group.)

A special category is the literary corpus, of which there are many kinds. Biblical and literary scholarship began the discipline of corpus linguistics long ago, and there is a lot of expertise available in literary circles on such things as establishing a canon of an author's works. Classification criteria include, as well as the author, the genre (odes, short stories, etc.), the period, the group (Augustan poets, campus novelists, etc.), the theme (revolutionary writings, etc.). Drama is usually kept separate from prose and poetry.

Corpora of the language of children, geriatrics, non-native speakers, users of extreme dialects and very specialised areas of communication (like the heraldic blazon or the knitting pattern, or the auctioneer's patter) should also be designated special corpora because of the unrepresentative nature of the language involved.

Note that the special corpus is different in principle from a corpus that features one or other variety of normal, authentic language. A corpus of conversations is not a special corpus, nor is a corpus of newspaper text, or even one particular newspaper. There is a distinction made here between variety within the limits of reasonable expectation of the kind of language in daily use by substantial numbers of native speakers, and varieties which for one reason or another deviate from the central core. The special corpora are those which do not contribute to a description of the ordinary language, either because they contain a high proportion of unusual features, or their origins are not reliable as records of people behaving normally.

Each component, then, illustrates a particular kind of language, and for each component there should be a descriptive label that indentifies the homogeneity of the material inside. The particularity of the language may be retained at corpus or subcorpus level without transferring the corpus into the `special' category.

Hence it is proposed that special corpora are designated as follows, e.g.

Special corpus: Poetry of Aphasics

A corpus which illustrates a particular variety of normal language is designated, e.g.

Corpus of The Times newspaper, 1991


next up previous contents
Next: Simplicity Up: Characteristics Previous: Quantity