next up previous contents
Next: Horizontalvertical and other Up: Problems and issues Previous: Syntactic annotation of corpora

Recommendations

The size of the corpus and the method of annotation

Another issue is the relation between the size of the corpus and the method of annotation. In order for a corpus to be useful, it needs to be very large, since small corpora do not reveal very much. Although the requisite size of a corpus can to a certain extent be said to be dependent on its use, in general the bigger the corpus is, the better. Increasing corpus size, however, implies automation of annotation. Manual annotation can be slow, expensive and difficult: it needs a number of expert linguists who will need to be trained especially for the annotation task. In theory, the best way to annotate big corpora is by the use of a parser. This, however, is problematic since up to the present there is no reliable parser that can deal with unrestricted text (i.e, a parser that will satisfactorily annotate any text presented to it, be it a newspaper, scientific journals or transcribed conversation or a novel).

There are two intermediate solutions. The first is to use an interactive parsing system. Such a parsing system will parse sentences automatically as far as it is able to, and asks for intervention from a trained operator if it cannot decide on a parse. The TOSCA system (Oostdijk 1991; van Halteren and Oostdijk 1993) is such a system. The second possibility is to let the parser annotate the whole corpus, with manual correction later, if this is deemed necessary and can be achieved. Marcus et al. (1993:317) have shown that correcting the Penn Treebank parser's output is significantly faster than manually annotating a corpus: on average, trained operators with a linguistic background needed about 20 minutes to correct 1,000 words; for manual annotating they needed about 44 minutes per 1,000 words. (These figures are useful only for comparison. In absolute terms they do not mean much as the speed of annotation is very much dependent on the complexity of the annotation scheme.)



next up previous contents
Next: Horizontalvertical and other Up: Problems and issues Previous: Syntactic annotation of corpora