Corpus encoding

Next: Linguistic annotation Up: About the documents of Previous: Corpus and text typology

Corpus encoding

The document on preliminary recommendations for corpus encoding (EAGLES, 1996a) is a first version developed in collaboration with the LRE MULTEXT project. The work is aimed at providing a set of encoding standards for corpus-based work in natural language processing applications, and is compliant with the Text Encoding Initiative. It involves the specification of

a minimal encoding level that corpora must achieve to be considered standardised in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database.

General principles and recommendations common to all documents constitute the first two parts of this report. Special attention is paid to the corpus encoding header in part 3, and part 4 offers a full definition of three levels of conformance in the encoding of primary data.

Provisions for encoding of linguistic annotation and a data architecture for linguistic corpora are also found in this document.

A full set of Appendices contains useful complementary information.