Methodology: Statistics underlying the assessment

Next: Corpus Statistics Up: Proposals for practical experiments Previous: Proposals for practical experiments

Methodology: Statistics underlying the assessment

Tagger accuracy is, among others, influenced by parameters such as lexicon accuracy and lexical ambiguity of words. We need to have a uniform test environment for the different tagging methods; therefore, to cope with the lexicon problems, we use identical (or at least comparable) lexicons for all taggers. Comparable test conditions may be set up by using the same lexicon (apart from differences of the internal format), the same tagset and the same test corpora for different taggers.

The taggers to be evaluated are all trained on the same manually tagged corpus. The list of annotated word forms used as ``tagger lexicon'' for each tagger, is built from a list of word/tag pairs, which is derived from the word forms contained in the training and in the test corpus. This full form list is computed by means of a morphological analyser ([&make_named_href('', "node40.html#DMOR","[Schiller 1995]")]) and of mapping rules to transform the morphological categories into the appropriate tags. These mapping rules may remove or add readings for specific word forms, according to the guidelines for manual tagging ([&make_named_href('', "node40.html#Schiller+al:95","[Schiller et al 1995]")]); however, in the process of word list creation, no corpus-dependent modifications of the lexicon are introduced:

The full form list is not restricted to readings which appear in the corpus, i.e. for each word the whole range of possible lexical tags is provided, not only those which occur in the actual training corpus.
Forms which cannot be morphologically analyzed, such as proper names, foreign language material and other ``gaps'', are preserved, i.e. the lexicons for all taggers contain the same information, but we do not ensure full coverage of the texts.

The following definitions will be used for statistics concerning ambiguity of word forms:

Ambiguity class
= set of tags associated with a word form.
Ambiguity type
= number of tags associated with a word form.
Ambiguity rate
= average number of tags per word form in a given corpus.

We will divide the evaluation statistics for a given test corpus into two parts: corpus statistics and error statistics.

Next: Corpus Statistics Up: Proposals for practical experiments Previous: Proposals for practical experiments