Rationale for the present proposal

The guidelines for morphosyntactic annotation are very similar to those for the morphosyntactic level in the lexicon. Large lexicons are increasingly being used in the annotation of corpora, and corpora are increasingly being used as sources of information to be acquired by lexicons. These processes are increasingly being automated. There is therefore a great advantage in being able to transduce directly from word-class annotations in texts to morphosyntactic information in lexicons, and vice versa. On the other hand, there are reasons for assuming that these two types of word classification need not be identical.

One reason for differences is that morphosyntactic annotation (which has been so far carried out extensively on English, but not on other languages) is at a relatively primitive stage of development. It is typically carried out largely automatically, but without the benefit of a full parse, frequently using simple statistical models of grammar such as Hidden Markov Models (Rabiner 1990).

There is a major problem of automatic tag disambiguation, resulting in a substantial rate of error or of failure to disambiguate (typically of several percent), and although these less-than-ideal results can in principle be corrected by hand, in practice the correction of a large corpus (say, of 100 million words) is a Herculean task. Thus, while annotators might wish to provide as much lexically relevant information as possible in the tagged corpus, in practice they are limited by what current taggers are realistically able to achieve. Some attributes or values routinely entered in lexicons are virtually impossible to mark automatically in a corpus without a prohibitive amount of error (e.g. the distinctions between the different functions of the base form of the English verb -- indicative plural, imperative, subjunctive, etc. -- are virtually impossible to make without a full parse, which itself would produce unreliable results in the present state of the art).

A second reason is the opposite of the first: just as there are kinds of information which are expected in a lexicon, but cannot be included in tagging, so there are kinds of information which may be useful for tagging, but may be extraneous to morphosyntax in the lexicon. It may be useful, for automatic tagging, to mark some syntactic or semantic distinctions, thereby going beyond the definition of morphosyntax. Examples include the marking of the purely syntactic distinction between attributive-only and predicative-only adjectives, or the marking of small semantic classes such as names of months or names of days, in order to facilitate the identification of dates (which have a distinctive syntactic structure) in certain kinds of texts. While these values are normally excluded from the morphosyntactic level in the lexicon, they can be easy to identify in texts, and may have a valuable syntactic role in disambiguating neighbouring words. Also, in text corpora, one constantly finds the necessity to deal with phenomena which have been regarded as peripheral to a lexicon, such as naming expressions (including proper nouns), acronyms, formulae and special symbols. In all these respects, it would artificially constrain tagging, and often make it less useful, if the tagset had to mirror the attributes and values typically found in lexicons. Grammatical tagging, to use the traditional term, is a less clearly definable process than is implied by the stricter term morphosyntactic annotation.

The relation between the lexicon guidelines and these morphosyntactic annotation guidelines will be explained in the section on harmonisation with lexicon proposals. At this point, it is important to note that the distinctions made in morphosyntactic tagging may usefully correspond to various linguistic levels (morphological, morphosyntactic, syntactic, semantic) in the lexicon. But the level with which they are centrally concerned is that of morphosyntax.

Considering `levels' in a different sense, it is also essential to distinguish levels of abstraction at which the notion of tagset may be identified.

Character-coding level:

This is the least abstract level, where we identify a morphosyntactic tag with a particular sequence of characters in a marked-up text.

Descriptive level:

This is a more abstract level, where a tag is identified with a set of attribute-value pairs in a morphosyntactic description of a particular language. For a completely explicit description, it is desirable to formalise this description as an attribute-value hierarchy with monotonic inheritance. The tagset may then be termed a logical tagset.

Cross-linguistic level:

This is the most abstract level, where we are examining attributes (e.g. number) and values (e.g. singular, plural) as generically applied to a number of different languages. This is the level we are concerned with in the guidelines.

The Intermediate Tagset suggested as a way of mapping different language-specific tagsets into a common set of attributes and values is an example of tags considered at this level.

Next: Harmonisation with proposals of Up: Morphosyntactic Annotation Previous: Introduction