next up previous contents
Next: References Up: Results Previous: Corpus statistics

Error statistics

The following table shows for each tagger model the absolute number of errors (including lexical errors) and the overall accuracy rate for each test corpus.

Tests are run on each text file, including training files. In the table below, highlighted numbers represent the results for texts which were not part of the training corpus for the given HMM model.

HMM-1 HMM-2 HMM-3

trained on

mix news tale
Corpus Type errors accuracy errors accuracy errors accuracy
mix1 mix 136 98.12 % 215 97.02 % 264 96.34 %
mix2 mix 318 98.30 % 408 97.82 % 555 97.04 %
europa mix 108 98.52 % 155 97.87 % 202 97.23 %
spiegel news 305 97.40 % 217 98.15 % 346 97.05 %
taz news 181 97.61 % 142 98.12 % 246 96.75 %
welt news 119 98.04 % 90 98.52 % 156 97.43 %
spektrum news 152 97.33 % 166 97.09 % 188 96.70 %
andersen tale 251 97.87 % 304 97.42 % 163 98.62 %
bechstein tale 161 97.85 % 163 97.82 % 94 98.74 %
grimm tale 150 97.94 % 154 97.89 % 134 98.16 %

With the chosen training and test corpora, the following tendencies can be observed; they would have to be confirmed or modified by further testing on other text genres (technical manuals, instructions, colloquial language, ...).

On the whole, the differences between the three configurations (training on sets TRAIN1, TRAIN2, TRAIN3) do not lead to very important differences in the performance of the tagging on the test material.

This may be due to the fact that the differences, which we assumed for the chosen text types, may not be very important, such that the statistical models trained from the three sets may not differ all that much. Tests with other text types (eg. technical manual style as opposed to newspaper style) might be more significant.

The impact of phenomena, which are specific for a given text type, might be tested separately, by means of training texts which are particularly geared to this question (and monitored accordingly).

In general, however, the test setup seems to be successfully applicable to carry out further tests concerning the impact of text types. Future tests should be based on a qualitative and quantitative description of the perceived differences between the test and the training texts; this could, at least in part, be obtained by the use of corpus query tools: with respect to certain constructions, quantitative profiles of test and training material could be established before the experiment is run; the tagger results on these phenomena should then be checked.



next up previous contents
Next: References Up: Results Previous: Corpus statistics