Introduction

This Report and its recommendations should be read in conjunction with EAGLES interim recommendations on Corpus Typology (EAGLES, 1996a). The categories in the Appendices have been prepared from the published accounts of practice and the few speculative discussions on the subject. The starting point is the NERC Report (Calzolari et al., 1995). Preliminary versions of this Report were circulated to members of the EAGLES Working Group on Corpora, and the Report was presented to a wider circle of colleagues at an EAGLES workshop in Madrid in January 1996. Feedback from these activities and from correspondents has been incorporated in this version.

In policy discussions about the format of these recommendations, it was made clear to the various groups of contributors that the desired format for classification is a hierarchy of attributes, each with a set of values, because this is a convenient format for database organisation and machine processing. This format has been adopted wherever possible. However there is a number of circumstances where it is not an appropriate model for the patterns of language in communication, and there recourse has been made to other formats. In some cases, the lack of a full and finite set of values simply reflects a lack of experience in applying such a typological classification; in a short time they are likely to be clarified. In other cases, the set of values is dependent on what the texts say about themselves (see the discussion of reflexive under External criteria) and therefore is not under the control of people who wish to classify; in other cases the governing criteria are internal, based on linguistic choices, and sometimes of quite a different order compared with external classifications. While these internal patterns are often reflected dimly in external distinctions, it would be a grave error of judgement to settle for a set of values that has a large arbitrary element, which in turn is likely to obscure the internal ones.

At times the attribute/value model is just irrelevant -- it would in fact be most unlikely that such a simplistic model would always be adequate for the description of human language.

The focus of this typology is written language. However, it is envisaged that some quantities of transcribed spoken material will normally form part of reference corpora, and so basic provision is made for the classification of spoken data in the classification. Towards the end of this report there is a short note on spoken texts.

Next: Text Encoding Initiative Up: Text Typology Previous: Editors and Assistants