Documentation and user information

As has become evident in preceding sections, an essential part of any corpus annotation project is a detailed documentation of the annotation scheme employed. (For syntactic annotation, an annotation scheme is alternatively called a parsing scheme.) Without documentation provided by the originators, an annotated corpus is extremely difficult for other users to apply to their own research tasks. Decisions taken in the development of the annotation scheme, as well as in its application, should be well documented in order to ensure that future users will apply the annotation scheme in a manner consistent with the originators of the scheme, and which then will be consistent in the new application.

The documentation should therefore include at least the following classes of information:

What layers of annotation (in terms of layers (a)-(h) above) have been undertaken. Each of these layers of annotation represents a wide area of possible annotation, and therefore more detail should be included in the documentation as to what phenomena in particular are marked by the annotation scheme.
What is the set of annotation devices used (e.g. brackets, labels).
What are the meanings of these devices (e.g. Ns = singular noun phrase; etc.). In the documentation of many existing schemes, all that is presented is a list of the labels used, and a short explanation of these symbols (e.g. VP -- Verb Phrase). As has been shown, this type of explanation is not sufficient to describe the use of a label in an annotated corpus. Each symbol should be described, and illustrated with one or more examples.
What are the conventions for the application of the annotation devices to texts. A parsing scheme (or `grammatical representation'; see Voutilainen 1994) is more than points 2 and 3 above. It includes the set of guidelines or conventions whereby the annotation symbols are to be applied to text sentences, such that (ideally) two different annotators, implementing the scheme manually to the same sentence, would agree on the analysis to be applied. In this sense, a detailed annotation scheme is a guarantee of consistency. A parsing scheme may include reference to a lexicon, to a grammar or to a reference corpus of annotated sentences. In practice it is very difficult for a parsing scheme to achieve total coverage and total explicitness in a corpus. Few corpus annotation projects have achieved anything like this. However, the nearest thing is the highly detailed parsing scheme provided by Sampson (1995) for the SUSANNE Corpus. Sampson's book discusses the various decisions taken in the application and development of the SUSANNE annotation scheme, and provides examples of the cases in which such decisions must be taken. In this respect, Sampson's book is, so far, a unique achievement.
What is the measurable quality of the annotation. This includes:
- to what extent the corpus has been checked
- accuracy rate
- consistency rate
These different measures of quality of annotation will depend mainly on how the corpus is annotated. An automatic annotation will require figures of accuracy -- usually given in terms of recall and/or precision. A recall of less than 100% indicates that some appropriate readings have been discarded, while a precision of less than 100% indicates that superfluous readings remain in the output in the form of system ambiguities (see Voutilainen et al. 1992 for discussion).
A consistency rate should be given for a manual or manually post-edited annotation. Different annotators can be given a certain percentage of overlapping material, and these can then be compared to produce consistency figures. The method of comparison should also be documented.
The extent to which the corpus has been checked overlaps to some extent with consistency checking, but may also be relevant for a large automatically annotated corpus. In some cases, automatically annotated corpora may be manually checked after annotation, and modifications to the automatic annotation should be documented.
Specificity -- How detailed/shallow is the analysis. To a certain extent, the specificity of the analysis may be shown by the levels of annotation that have been applied. However, more detailed documentation may be necessary in order to make clear the level to which an annotation is undertaken -- for example some aspects of a deep or logical grammar may be included in an annotation, while others are not marked (e.g. marking of `logical subject/object', but no marking of `traces').
Ambiguity -- To what extent and in what respects has disambiguation (of machine-generated ambiguities) been carried out. During the annotation of a corpus, all ambiguities may be resolved, or ambiguous structures may be left in the markup. Resolution of problematic ambiguities should be documented, as should any ambiguities that are left in the corpus .
Incompleteness -- To what extent and in what respects is annotation at any particular layer incomplete. At any particular level of annotation, certain markings may be ignored by the annotation scheme, for ease of automated annotation, or because of the intended purpose of the resource. This information should also be included in the documentation.

In the future, a further component of documentation will presumably be necessary. Assuming that there will eventually be acceptance of EAGLES guidelines as a standard for syntactic annotation, it will be highly desirable to state to what extent and in what respects a given annotation scheme conforms to the EAGLES standard. We reiterate, however, that at the present stage, the guidelines put forward in this document are highly provisional.

Next: Bibliography Up: Syntactic Annotation Previous: Bracketing of single-word constituents