Corpus annotation is the practice of adding interpretative, especially linguistic, information to a text corpus, by coding added to the electronic representation of the text itself. A typical case of corpus annotation is that of morphosyntactic annotation (also called grammatical tagging), whereby a label or tag is associated with each word token in the text, to indicate its grammatical classification (tagset guidelines).

For a written text, it is generally easy to make a distinction between the electronic representation of the text itself and annotations which are added to the text. On the other hand, for a spoken text (i.e. a transcribed representation of a spoken discourse) the difference between the text and its annotations cannot be taken for granted, particularly in the areas of phonemic, phonetic and prosodic transcription. Here the representation of the text itself entails linguistic interpretation at the phonological level. For the purposes of EAGLES, however, features of phonemic / phonetic / prosodic transcription are not considered to be part of the annotation. Consideration of such features may be found in a companion document on Spoken Language.

In principle, annotation can represent any type of analytic information about the language of a text. In practice, so far, the two types of annotation most commonly applied to a text have been:

Morphosyntactic annotation:
Annotation of the grammatical class of each word-token in a text, also referred to as ``grammatical tagging'' or ``part of speech (POS) tagging'';
Syntactic annotation:
Annotation of the structure of sentences, e.g. by means of a phrase-structure parse or dependency parse.

Other types of annotation which have been applied to text are:

Semantic annotation:
For example, annotating word-tokens for their dictionary sense, or for their semantic category;
Discourse annotation:
For example, the marking of discoursal relations such as anaphora in a text;
Lemma annotation:
Indicating the lemma of each word-token in a text.

Because of their relative feasibility and their obvious application to areas such as lexicon and grammar development, morphosyntactic and syntactic annotation are regarded as the most important kinds of annotation at the present stage of the development of text corpora. They are certainly the best-developed types and those for which there are well-established working practices. Hence, they will be the major topics of EAGLES recommendations. Morphosyntactic annotation, in particular, is the subject of recommendations presented here. Syntactic annotation is the subject of a separate document. Other types of annotation, such as semantic tagging, are necessarily given less attention at the present stage, as the work that has been done in these areas is less systematised and more experimental. Lemma annotation is closely related to morphosyntactic annotation, and may be treated as an adjunct to it.

At the current stage, detailed provisional conclusions have been reached on the recommendation of standards for morphosyntactic annotation (or grammatical tagging, as it is generally called).

