next previous contents
Next: Special extensions - Optional Up: Word categories: Tagset guidelines Previous: Obligatory attributes/values

Recommendations

Recommended attributes/values

These are specified below under part-of-speech headings. Each numbered heading refers to the number assigned under major category. The set of values for each attribute is definitely not a closed set and will need to be augmented to handle peculiar features of individual languages. Not all EU languages will instantiate all attributes or all values of an individual attribute. For each attribute, 0 designates a zero value, meaning ``this attribute is not applicable'' for the particular language, or for a particular textword in that language. The standard requirement for these recommended attributes/values is that, if they occur in a particular language, then it is advisable that the tagset of that language should encode them.

1. Nouns (N)

[Generic optional attributes Language specific optional attributes]

(i) Type: 1. Common 2. Proper      
(ii) Gender: 1. Masculine 2. Feminine 3. Neuter    
(iii) Number: 1. Singular 2. Plural      
(iv) Case: 1. Nominative 2. Genitive 3. Dative 4. Accusative 5. Vocative

Inflection type is omitted as an attribute, as it is purely morphological.

2. Verbs (V)

[Generic optional attributes Language specific optional attributes]

(i) Person: 1. First 2. Second 3. Third  
(ii) Gender: 1. Masculine 2. Feminine 3. Neuter  
(iii) Number: 1. Singular 2. Plural   
(iv) Finiteness: 1. Finite 2. Non-finite   
(v) Verb form / Mood: 1. Indicative 2. Subjunctive 3. Imperative 4. Conditional
    5. Infinitive 6. Participle 7. Gerund 8. Supine
(vi) Tense: 1. Present 2. Imperfect 3. Future 4. Past
(vii) Voice: 1. Active 2. Passive   
(viii) Status: 1. Main 2. Auxiliary   

Attribute (v) has two names because of different traditions, for different European languages, regarding the use of the term Mood. In fact, the first four values (v) 1-4 are applicable to Finite Verbs and the last four (v) 5-8 to Non-finite Verbs.

Attribute (vii) Voice refers to the morphologically-encoded passive, e.g. in Danish and in Greek. Where the passive is realised by more than one verb, this does not need to be represented in the tagset.

The same applies to compound tenses (attribute (vi)). In general, compound tenses are not dealt with at the morphosyntactic level, since they involve the combination of more than one verb in a larger construction.

3. Adjectives (AJ)

[Generic optional attributes Language specific optional attributes]

(i) Degree: 1. Positive 2. Comparative 3. Superlative  
(ii) Gender: 1. Masculine 2. Feminine 3. Neuter  
(iii) Number: 1. Singular 2. Plural   
(iv) Case: 1. Nominative 2. Genitive 3. Dative 4. Accusative

Attribute (i) Degree applies only to inflectional comparatives and superlatives. In some languages, e.g. Spanish, the number of such adjectives is very small.

4. Pronouns and Determiners (PD)

[Generic optional attributes Language specific optional attributes]

(i) Person: 1. First 2. Second 3. Third  
(ii) Gender: 1. Masculine 2. Feminine 3. Neuter  
(iii) Number: 1. Singular 2. Plural   
(iv) Possessive: 1. Singular 2. Plural   
(v) Case: 1. Nominative 2. Genitive 3. Dative 4. Accusative
    5. Non-genitive 6. Oblique   
(vi) Category: 1. Pronoun 2. Determiner 3. Both  
(vii) Pron.-Type: 1. Demonstrative 2. Indefinite 3. Possessive 4. Int./Rel.
    5. Pers./Refl.    
(viii) Det.-Type: 1. Demonstrative 2. Indefinite 3. Possessive 4. Int./Rel.
    5.Partitive    

The parts of speech Pronoun, Determiner and Article heavily overlap in their formal and functional characteristics, and different analyses for different languages entail separating them out in different ways. For the present purpose, we have proposed placing Pronouns and Determiners in one `super-category', recognising that for some descriptions it may be thought best to treat them as totally different parts of speech.

There is also an argument for subsuming Articles under Determiners. The present guidelines do not prevent such a realignment of categories, but do propose that articles (assuming they exist in a language) should always be recognised as a separate class, whether or not included within determiners. The requirement is that the descriptive scheme adopted should be automatically mappable into the present one via an Intermediate Tagset.

Attribute (iv) accounts for the fact that a possessive pronoun or possessive determiner may have two different numbers. This attribute handles the number which is inherent to the possessive form (e.g. Italian (la) mia, (la) nostra as first-person singular and first-person plural) as contrasted with the number it has by virtue of agreeing with a particular noun (e.g. Italian (la) mia, (le) mie).

Under attribute (v) Case, the value Oblique applies to pronouns such as them and me in English, and equivalent pronouns such as dem and mig in Danish. These occur in object function, and also after prepositions.

Under attributes (vi) and (vii), the subcategories Interrogative and Relative are merged into a single value Int./Rel.. It is often difficult to distinguish these in automatic tagging, but they may be optionally distinguished at a more delicate level of granularity.

Similarly, under attribute (vi), Personal and Reflexive pronouns are brought together as a single value Pers./Refl.. Again, they may be optionally separated at a more delicate level.

5. Articles (AT)

[Language specific optional attributes]

(i) Article-Type: 1. Definite 2. Indefinite   
(ii) Gender: 1. Masculine 2. Feminine 3. Neuter  
(iii) Number: 1. Singular 2. Plural   
(iv) Case: 1. Nominative 2. Genitive 3. Dative 4. Accusative

6. Adverbs (AV)

[Generic optional attributes Language specific optional attributes]

(i) Degree: 1. Positive 2. Comparative 3. Superlative

There are many possible subdivisions of adverbs on syntactic and semantic grounds, but these are regarded as optional rather than recommended.

7. Adpositions (AP)

[Generic optional attributes Language specific optional attributes]

(i) Type: 1. Preposition

In practice, the overwhelming majority of cases of adpositions we have to consider in European languages are prepositions. Hence only this one value needs to be recognised at the recommended level. Other possibilities, such as Postpositions and Circumpositions are dealt with at the optional level.

8. Conjunctions (C)

[Generic optional attributes Language specific optional attributes]

(i) Type: 1. Coordinating 2. Subordinating

9. Numerals (NU)

(i) Type: 1. Cardinal 2. Ordinal   
(ii) Gender: 1. Masculine 2. Feminine 3. Neuter  
(iii) Number: 1. Singular 2. Plural   
(iv) Case: 1. Nominative 2. Genitive 3. Dative 4. Accusative
(v) Function: 1. Pronoun 2. Determiner 3. Adjective  

In some languages (e.g. Portuguese) this category is not normally considered to be a separate part of speech, because it can be subsumed under others (e.g. cardinal numerals behave like pronouns/determiners; ordinal numerals behave more like adjectives). We recognise that in some tagsets Numeral may therefore occur as subcategory within other parts of speech. (Compare the treatment of articles under 5 above). At the same time, it is possible to indicate the part-of-speech function of a word within the numeral category by making use of attribute (v).

10. Interjections (I)

No subcategories are recommended.

11. Unique/Unassigned (U)

[Explanation Language specific optional attributes]

No subcategories are recommended, although it is expected that tagsets for individual languages will need to identify such one-member word-classes as Negative particle, Existential particle, Infinitive marker, etc. (further details.)

12. Residual (R)

[Explanation]

(i) Type: 1. Foreign word 2. Formula 3. Symbol 4. Acronym 5. Abbreviation
    6. Unclassified     
(ii) Number: 1. Singular 2. Plural    
(iii) Gender: 1. Masculine 2. Feminine 3. Neuter   

The Unclassified category applies to word-like text segments which do not easily fit into any of the foregoing values. For example: incomplete words and pause fillers such as er and erm in transcriptions of speech, or written representations of singing such as dum-de-dum.

Although words in the Residual category are on the periphery of the lexicon, they may take some of the grammatical characteristics, e.g., of nouns. Acronyms such as IBM are similar to proper nouns; symbols such as alphabetic characters can vary for singular and plural (e.g. How many Ps are there in `psychopath'?), and are in this respect like common nouns. In some languages (e.g. Portuguese) such symbols also have gender. It is quite reasonable that in some tagging schemes some of these classes of word will be classified under other parts of speech.

13. Punctuation marks (PU)

[Explanation]

Word-external punctuation marks, if treated as words for morphosyntactic tagging, are sometimes assigned a separate tag (in effect, an attribute value) for each main punctuation mark:

(i) 1. Period 2. Comma 3. Question mark ...etc. ...

An alternative is to group the punctuation marks into positional classes:

(i) 1. Sentence-final 2. Sentence-medial 3. Left-Parenthetical 4. Right-Parenthetical

Under 1 are grouped . ? !. Under 2 are grouped , ; : -- . Under 3 are placed punctuation marks which signal the initiation of a constituent, such as (, [ , and ¿ in Spanish). Under 4 are grouped punctuation marks which conclude a constituent the opening of which is marked by one of the devices in 3: e.g. ), ] and Spanish ? . We make no recommendation about choosing between these two sets of punctuation values. gif


next up previous contents
Next: Special extensions - Optional Up: Word categories: Tagset guidelines Previous: Obligatory attributes/values