Intermediate Tagset

Next: Underspecification and ambiguity in Up: Recommendations for morphosyntactic categories Previous: Special extensions - Optional

Recommendations

Intermediate Tagset

For any tagset designed for the annotation of texts in a given language, the guidelines do not impose any particular set of choices to be used in distinguishing and representing grammatical categories. But it is important that the tagset should be mappable (if possible automatically) on to a set of attribute-value pairs in conformity with the guidelines. This includes the possibility (indeed the probability) that the annotator will need to define optional values other than the special extensions.

This mapping will have the additional value that it will enable the annotator to transfer the information in a morphosyntactically-tagged corpus to the morphosyntactic component of a lexicon (e.g. in order to record frequencies of word-tag pairs). It will also enable a lexicon of the given language to be used as a major input to automatic tagging.

To aid this mapping, and to test out its efficacy, we suggest that an Intermediate Tagset can be used as a language-neutral representation of a set of attribute-value pairs, based on the word categorisation. This can act as an intermediate stage of mapping between the tags assigned to textwords in corpus annotation and the labels assigned to words in a lexicon. Another important function of this Intermediate Tagset is to act as a basis for interchange between different local tagsets for particular corpora and particular languages.

A convenient linear method of representation is arrived at as follows:

(i)

Represent the obligatory part-of-speech attribute value by using one or more letters, as indicated in obligatory major categories:

N = noun	AV = adverb	I = interjection
V = verb	AP = adposition	U = unique/unassigned
AJ = adjective	C = conjunction	R = residual
PD = pronoun/determiner	NU = numeral	PU = punctuation
AT = article

(ii)

Represent the whole tag as a linear string of characters, each attribute (roman number (i), (ii), (iii), (iv), ...) representing the first, second, third, fourth,...place in a string of digits.

(iii)

Represent each value of each attribute by employing the arabic digits used in the recommended attributes and values. Thus, the interpretation of the string of digits will vary according to the part-of-speech category. (The optional attributes and values may also be used, but have to be specially defined for each tagset).

Examples:

A common noun, feminine, plural, countable, is represented: N122010
A 3rd person, singular, finite, indicative, past tense, active, main verb, non-phrasal, non-reflexive, verb is represented: V3011141101200
A comparative, general adjective is represented: AJ2000000
A coordinating conjunction, simple, is presented: C110
An interjection is represented: I
A plural symbol (as in two Bs) is represented: R320

Wherever an attribute is inapplicable to a given word in a given tagset, the value 0 fills that attribute's place in the string of digits. (See further, for the use of 0, the section on underspecification). When the 0s occur in final position, without any non-zero digits following, they could be omitted without loss of information. Thus a comparative general adjective could simply be represented: AJ2. However, for clarity, the 0s should be added.

There may be cases where a category needed for tagging in a specific language (given current limitations of automatic tagging) cuts across two or more values in the optional categories of the guidelines, and may even cut across different attributes as well. It is necessary to define what this value means by using the OR operator ( | ), and brackets to identify the arguments of this operator. Another operator we can use is the negative operator, signalled by the minus (-), so that -4 means ``all values of this attribute except the 4th''.

A good example is the base form of the English verb. The finite base form in English can be specified by using a disjunction ``[finite indicative present tense [plural or [first person or second person] singular] or imperative or subjunctive]''. This is spelled out, using the intermediate tagset, as follows:

V[[-301|002]111|000121|000130]0100000

Even this leaves out the non-finite use of the base form, as an infinitive. This example, awkward as it is, has an explanatory value: the relation between tagsets and a language-neutral representation can be very indirect. Although such cases as this are unusual, they show that the mapping between a lexicon and a tagged corpus is not always an easy one to automate.

To illustrate the method of converting a tagset into this type of language-neutral labelling, we present in Appendix A a rendering into an Intermediate Tagset of a tagset for English and in Appendix B of a set of dictionary codes for Italian; the former are based on the English implementation of the lexicon guidelines and the latter on the codes of the DMI (Calzolari et al. 1980). (For English, with its simple morphology, we find the most complex interrelation between the morphosyntactic guidelines and the requirements of a particular language. With other languages, the mapping from the language-specific tagset to the Intermediate tagset is likely to be more straightforward.)

Next: Underspecification and ambiguity in Up: Recommendations for morphosyntactic categories Previous: Special extensions - Optional