Recommended annotations

In this section, we propose informal criteria for the marking of a number of labels, corresponding to the category labels proposed by the Lexicon/Syntax SubGroup. The main motivation for following the recommendations formulated by the Lexicon/Syntax Subgroup is that the structure of sentences is determined to a large extent by the lexical properties of words, and in order to ensure compatability of lexicons and corpora, if possible the syntactic labellings employed in lexicons should be employed in the annotation of corpora.

The recommended syntactic labels and constituents proposed in the report of the Lexicon/Syntax Subgroup are widely recognised and well-established in current corpus annotation practices that use phrase-structure-like representations. They can be seen as `familiar landmarks' which anyone wishing to make use of a syntactically annotated corpus would probably expect to find. Arguably, they represent the constituents and classifications that are to be used to arrive at a minimally interesting annotated corpus; these constituents are outlined and defined in this section.

Apart from the recommended syntactic categories, we define and illustrate a number of optional ones. These optional categories are merely illustrative for the purposes of the present report, and represent a cross-section of the annotation practices currently employed.

The form of the codings we use is tentative and has been chosen for convenience (see text representation). We use rather verbose labels for readability's sake. The exact notation, e.g. whether to use a notation like Subj-Agent or Su/Ag to denote that a constituent is the Subject of the clause and has the semantic function Agent, will presumably be determined in the documents to be produced by the Text Representation Subgroup within EAGLES. Essentially, the mark-up of the codes is trivial, since (assuming that the labels have been applied consistently and unambiguously) they may be mapped onto any other mark-up.

Each category will be informally defined and illustrated by one or more examples.

However, it should be emphasised that the `semantics' of these category labels (in the sense of the set of objects in corpus data they refer to) is variable and above all language-dependent. There can be no assumption, for example, that the Noun Phrase as used in one annotation scheme will map straightforwardly on to the Noun Phrase as used in another. (See further the discussion of `parsing schemes''.)

In addition to these examples, and by way of further illustration, in Appendix A we will present short passages (of about ten sentences) from some of the EU languages covered by this document. It is recommended that groups that produce an annotated corpus minimally document the corpus in a similar way (see section 6). That is, they should list the categories used and provide an example of each; they should also provide at least a small stretch of text illustrating the application of the annotation scheme.

The recommendations that we make are expressed in a form which applies to phrase structure models of annotation. Examples of dependency-based models are few, and it would be premature to formulate recommendations for dependency-based annotation. However, reference should be made to the Helsinki Constraint Grammar as a key exemplar.

If a phrase structure annotation is adopted, we recommend the following categories:

In the following subsections we will define and illustrate these labels and structures. We will also mention some tests that can be used to determine a number of constituents. (Such tests can be found in a number of introductory textbooks, e.g. Radford 1988.)

It should be noted that the examples illustrating these labels are meant to illustrate only the labels themselves and not to provide a complete annotation.

