next up previous contents
Next: Interpretation Up: Tagger evaluation Previous: TreeTagger: Standard test

Xerox HMM-tagger: Standard test

 

To compare the effects of using different lexicons for training and testing, we chose two test setups for the Xerox HMM-tagger: one with the regular lexcion only, and a second with the extended lexicon, which is the same as for the test of of the TreeTagger.

(1) Regular training lexicon

Corpus statistics
Tokens 62860 13416
Tags 51 46
Lexicon gaps 1756 283
Lexical Errors 543 65
Ambiguity classes 263 196
Ambiguity rate 1.69 1.67

Error statistics
ambiguity tokens in % correct in % LE in % DE in %
1 7978 59.5 7942 99.6 36 0.5 - -
2 2663 19.9 2482 93.2 13 0.5 168 6.3
3 2078 15.5 2014 96.9 8 0.4 56 2.7
4 589 4.4 518 88.0 7 1.2 64 10.9
5 81 0.6 71 87.9 1 1.2 9 11.1
6 19 0.1 16 84.2 0 - 3 15.8
7 8 0.1 7 87.5 0 - 1 12.5
total 13416 100.0 13050 97.3 65 0.5 301 2.2

Most frequent errors (by word form)
number word correct tag tagger tag
13 DM NN NE
9 Osthold NE ADJD
6 das PDS ART
5 werden VAFIN VAINF
5 Reich NE NN
4 haben VAFIN VAINF
4 dem PRELS ART
4 Um KOUI APPR
4 Deutschland NE NN

Most frequent errors (by tags)
number correct tag tagger tag
55 NN NE
39 NE NN
28 VVFIN VVINF
16 VVFIN VVPP
12 KON ADV
11 ADJD VVPP
11 ADJD ADV
10 VVINF VVFIN
10 NE ADJD

(2) Extended training lexicon

Even though the Xerox HMM tagger is based on the same basic lexicon as the TreeTagger (in 6.1.1) the ambiguity class of the word forms is not always the same (as one would have expected).

This difference is due to the fact that the lexicon which is used internally by the TreeTagger omits marginal (i.e. very rare) readings of word forms and thus reduces the ambiguity classes of lexicalized word forms. For non-lexicalized word forms, however, the TreeTagger uses ambiguity classes containing up to 10 tags, whereas the largest ambiguity class of the tested Xerox HMM tagger is 7.

Corpus statistics
Tokens 62860 13416
Tags 51 46
Lexicon gaps 0 241
Lexical errors 0 49
Ambiguity classes 275 205
Ambiguity rate 1.64 1.69

Error statistics
ambiguity tokens in % correct in % LE in % DE in %
1 7962 59.4 7932 99.6 30 0.4 - -
2 2660 19.8 2481 93.3 11 0.4 168 6.3
3 2082 15.5 2019 97.0 5 0.2 58 2.8
4 480 3.6 399 83.1 3 0.6 78 16.3
5 178 1.3 161 90.5 0 - 17 9.5
6 47 0.4 38 80.9 0 - 9 19.1
7 7 0.1 5 71.4 0 - 2 28.6
total 13416 100.0 13035 97.2 49 0.4 332 2.5

Most frequent errors (by word form)
number word correct tag tagger tag
13 DM NN NE
9 Osthold NE ADJD
6 das PDS ART
5 werden VAFIN VAINF
5 Reich NE NN
4 haben VAFIN VAINF
4 dem PRELS ART
4 Um KOUI APPR
4 Deutschland NE NN
4 Asher NE ADJA

Most frequent errors (by tags)
number correct tag tagger tag
42 NE NN
41 NN NE
28 VVFIN VVINF
20 NE ADV
18 NE ADJD
15 VVFIN VVPP
12 KON ADV
12 ADJD ADV
11 ADJD VVPP



next up previous contents
Next: Interpretation Up: Tagger evaluation Previous: TreeTagger: Standard test