Corpus Statistics

Next: Error Statistics Up: Methodology: Statistics underlying the Previous: Methodology: Statistics underlying the

Corpus Statistics

The corpus statistics depends only on the test corpus and on the lexicon, and is therefore the same for all taggers, in our tests.

The lexicon, however, may be considered part of the tagger method, e.g. if lexicon modifications (such as omitting ``marginal'' readings of word forms) are part of the disambiguation method. In that case, the lexicon-dependent corpus statistics may differ even if the tested taggers rely on identical training (and testing) lexicons. Also, statistics concerning unknown word forms, depend mainly on the tagger method and are thus not identical for the tested tagger methods.

In general, all the figures given below do not depend on the disambiguation result, i.e. they can be computed before automatic tagging is applied, by associating the manually tagged test corpus with the lexical information and the tags for non-lexicalised word forms.

Number of tokens in test corpus;
Number of tags in (manually tagged) test corpus;
Number of lexicon gaps (= not-lexicalized word forms);
Number of lexical errors or defective word forms, i.e. word forms for which the lexicon (or the guesser method) does not include the tag which is given in the manually tagged text;
Number of ambiguity classes in the corpus;
Ambiguity rate of the corpus word forms.