next up previous contents
Next: Error Statistics Up: Methodology: Statistics underlying the Previous: Methodology: Statistics underlying the

Corpus Statistics

 

The corpus statistics depends only on the test corpus and on the lexicon, and is therefore the same for all taggers, in our tests.

The lexicon, however, may be considered part of the tagger method, e.g. if lexicon modifications (such as omitting ``marginal'' readings of word forms) are part of the disambiguation method. In that case, the lexicon-dependent corpus statistics may differ even if the tested taggers rely on identical training (and testing) lexicons. Also, statistics concerning unknown word forms, depend mainly on the tagger method and are thus not identical for the tested tagger methods.

In general, all the figures given below do not depend on the disambiguation result, i.e. they can be computed before automatic tagging is applied, by associating the manually tagged test corpus with the lexical information and the tags for non-lexicalised word forms.