Lexical specifications vs. corpus tagsets

Next: The concept of common Up: Some relevant aspects for Previous: Some relevant aspects for

Preliminary Recommendations

Lexical specifications vs. corpus tagsets

The interdependence between lexicon and corpus is an important aspect for any activity aiming at creating lexicons and/or tagsets to be shared by and made available to the community. The background motivation to this was essentially the view of corpus tagging as just one of the possible applications of a computational lexicon, which has to be seen in a more neutral context as an application-independent set of lexicon specifications. Corpus tagging is in fact the first obvious application of a computational lexicon and cannot be developed on an independent basis: both the lexicon specialists and the corpus specialists feel that it is important to reconcile their two views.

The difference in perspective betwen the lexicon specification area and that of corpus annotation can be seen at the level of terminology:

The terms feature and feature set are preferred when talking about lexicon descriptions;
The terms tag and tagset are preferred to refer to the information associated with words in context, i.e. in corpus annotation.

For the sake of reusability, lexical descriptions should be (as far as possible) independent of specific applications, and should aim at a general description of each language.

The actual corpus tags depend on at least the following:

The lexicon features; and
The capabilities of state-of-the-art taggers to disambiguate between different lexicon descriptions or different types of homography present in different languages.

Therefore, morphosyntax can be encoded in a lexicon with fine granularity, while a set of corpus tags usually reflects broader categories.

Corpus tags are, in fact, developed for each language with a particular application in mind, that of producing a corpus tagged for part of speech (and possibly other morphosyntactic information) by means of automatic disambiguation. It would be ideal to tag a corpus with the lexical descriptions themselves. However, it is well known that this is considerably beyond the capabilities of state-of-the art tagging techniques. Corpus tags are, therefore, to be seen as kinds of underspecified lexical tags. There are two reasons why we may want (or need) to underspecify corpus tags:

Experience shows that some distinctions are difficult to get right with a high rate of accuracy. For example, in some languages, the disambiguation of indicative present and subjunctive present in a corpus is extremely difficult by automatic means.
In order to train a tagger, we typically need statistical tables (based on co-occurrences of tags). If we have a large tagset, we need a very large corpus to train the disambiguator, in order to observe rare co-occurrences. For example, in the proposal for French presented in the MULTEXT document (Bel et al., 1995), there are 249 different lexical descriptions, but only 74 collapsed corpus tags. Experience (Church, UPenn Treebank, IBM France, etc.) shows that the tagset should be under 100 tags. In fact, the Penn Treebank project collapsed many tags compared to the original Brown tagset, and got better results.

Two other observations are of relevance as regards the relation between lexicon specifications and corpus tags.

Sometimes tag classes are in reality different from lexical descriptions. For example, classes for punctuation are needed and certain types of semantic or pragmatic or lexical information can be present in the tags (e.g. the days of the week).
Furthermore, decisions on tag collapsing are often language dependent and therefore it may not be appropriate to have completely identical tagsets across languages. We must preserve certain language-specific peculiarities (e.g. if certain distinctions can be easily maintained by an automatic tagger, it may be useful to preserve them).

Next: The concept of common Up: Some relevant aspects for Previous: Some relevant aspects for