next up previous contents
Next: Xerox HMM-tagger: Verb test Up: Tagset evaluation Previous: Practical setups

Xerox HMM-tagger: Noun test

The most frequent error in the standard test concerned common nouns (NN) and proper names (NE).

The table below shows some statistics about the frequency of NN and NE in the standard training and test corpora depending on the ambiguity of word forms with respect to NN and NE.

training corpus test corpus
NN in ambiguity class 14,606 3,249
unambiguous NN 9,861 (67.5 %) 2,240 (69.0 %)
tagged as NN 2,835 (87.3 %)
incorrect 48 (1.7 %)
instead of NE 39 81.3 %
NE in ambiguity class 3,621 706
unambiguous NE 1,654 (45.7 %) 320 (45.3 %)
tagged as NE 603 (85.4 %)
incorrect 64 (10.6 %)
instead of NN 55 85.9 %
NN NE in ambiguity class 1,912 372
tagged as NN 94 (25.3 %)
incorrect 29 (30.9 %)
instead of NE 27 93.1 %
tagged as NE 273 (73.4 %)
incorrect 55 (20.2 %)
instead of NN 51 92.7 %
tagged NN NE 367 (98.7 %)
incorrect 84 (22.9 %)
inverted NN-NE 78 92.9 %

These figures show that the overall error rate for common nouns (NN) is less than 2 %, whereas the automatically associated proper name tag (NE) is wrong in more than 10 % of all cases.

Both for common nouns and proper names the most frequent error is a confusion of the tags NE-NN (due to disambiguation as well as to lexical errors). Thus, we should expect an improvement of tagger accuracy if we put NE and NN together and use a single tag NOUN instead.

Test: Put NE and NN together in single class (NOUN)

Corpus statistics
Tokens 62860 13416
Tags 50 45
Lexical gaps 1756 283
Lexicon errors 355 49
Ambiguity classes 242 181
Ambiguity rate 1.66 1.65

Error statistics
ambiguity tokens in % correct in % LE in % DE in %
1 8131 60.6 8110 99.7 21 0.3 - -
2 2530 18.9 2385 94.3 12 0.5 133 5.3
3 2192 16.3 2119 96.7 13 0.6 60 2.7
4 493 3.7 449 91.1 2 0.4 42 8.5
5 62 0.5 59 95.2 1 1.6 2 3.2
6 8 0.1 7 87.5 0 - 1 12.5
total 13416 100.0 13129 97.9 49 0.4 238 1.8

Most frequent errors (by word form)
number word correct tag tagger tag
9 Osthold NOUN ADJD
6 das PDS ART
5 werden VAFIN VAINF
4 dem PRELS ART
4 Um KOUI APPR
4 Reich NOUN ADJD

Most frequent errors (by tags)
number correct tag tagger tag
30 VVFIN VVINF
16 VVFIN VVPP
15 NOUN ADJD
14 NOUN ADV
12 NOUN ADJA
12 KON ADV
11 ADJD VVPP
11 ADJD ADV

A comparison of the statistical results as displayed here should be based on the results reported in section 6.1.2, page gif. There we have given the figures for the standard situation, on the basis of the Xerox HMM Tagger. Now, we give the figures for the same tagger, in a situation where common nouns and proper names are in a common ``noun'' class and thus not tagged differently.

The following changes are evident, as far as the corpus and lexicon characteristics are concerned: the corpus ambiguity rate, both for training and test corpus, is reduced (training corpus: from 1.69 in the standard test to 1.66 in the noun test; test corpus: from 1.67 to 1.65). Accordingly, the overall correctness rate increases from 97.3% to 97.9%. This result is expected: in section 6.1.2, clearly disambiguation errors at the level of NN vs. NE are a considerable part of the tagging errors. Evidently, now other disambiguation errors are most highly ranked in frequency: in this case, the ambiguities between finite and infinite verbs and between finite verbs and participles. However, still the problems encountered in the tagging of noun candidates are not solved completely. The table of error frequency by tags shows that now there are tag confusion pairs of the type NN vs. ADJD, NN vs. ADJ, NN vs. ADJA. Remember that, for example, the form Osthold led to NE vs. ADJD errors in the test displayed in section 6.1.2.

According to our calculations, nevertheless the amount of errors produced by the NN vs. NE confusion can be reduced to 46% of the original figure, by merging the NN and NE classes.

training corpus test corpus
NOUN in ambiguity class 16,320 3,583
unambiguous 12,174 (74.6 %) 2,713 (75.7 %)
tagged as NOUN 3,436 (95.9 %)
incorrect 25 (0.7 %)



next up previous contents
Next: Xerox HMM-tagger: Verb test Up: Tagset evaluation Previous: Practical setups