next up previous contents
Next: Internal and external criteria Up: Introduction Previous: Introduction

Preliminary Recommendations

Text Encoding Initiative

The Text Encoding Initiative guidelines (Sperberg-McQueen & Burnard (eds), 1993) suggest a classification of texts based purely on external criteria (i.e. non-linguistic criteria). The categories and subcategories specified are both exhaustive and exhausting to contemplate, and in practical terms, that is in terms of person hours needed to encode each text as well as computer resources needed to store the headers, the standards do not seem feasible for the size of corpora advocated in EAGLES (see the companion document on Corpus Typology. At the minimum level of encoding according to TEI, we record the details of how the texts were made available in machine readable form (method, persons responsible, date and so on), details of publication (including date, address, copyright status, etc.) and a bibliographic description (information about the author such as age, sex, mother tongue, nationality and so on, and details of how the text is published -- series, journal and so on). It is also recommended that the size of the file and statements on edition and series are recorded. Within these categories all non-linguistic information about the text can be stored.

The list is endless, and there is no requirement of linguistic relevance to help us distinguish the external features which are likely to be of value for users, and those which are only of archival interest. Almost any conceivable feature may come into relevance on a future occasion, but that is not sufficient reason for investing in the cost of ascertaining and recording such a feature in respect of each text in the corpus.

In any case, experience has already shown that such precategorisation is almost certainly going to be inadequate, for two reasons. One is just the record of experience -- customers always want features that are not readily available; the other is that the reasons that bring into focus some feature of the context of situation of a text will involve a reconceptualisation of the relationship between the language and the situation, and that will marginalise any previous classification in the area.

Consider the matter of `political correctness', which is typical of a parameter of classification of language that was not recognised at all a few years ago, but might now be the object of serious studies. Various kinds of offensive language have traditionally been recorded, and blasphemy, for example, has a long history, as has pornography. Racist language has more recently been recognised and used in classification, and sexist language also. But political correctness requires a reconceptualisation of the relationship between writer, text, reader and third party that has specifically not informed the earlier classifications; therein lies its interest. Until we have the concepts that give an adequate reason for the classification, the classification is not worth doing.

So it is not good policy to second-guess the future. Instead we should pay close attention to those external features that are known or strongly suspected to influence linguistic choices systematically.

In parallel to this report, an EAGLES working group on Text Representation makes practical recommendations on how TEI policies can be adapted to corpora.



next up previous contents
Next: Internal and external criteria Up: Introduction Previous: Introduction