Task dependency/reuseability of resources

Next: Syntax/semantics and syntax/discourse boundaries Up: Problems and issues Previous: Horizontalvertical and other

Recommendations

Task dependency/reuseability of resources

Corpora, both raw and annotated, may be produced for a wide variety of uses. A corpus may be intended for use purely in linguistic research (studying the variation in language use across genres, or across time; studying the frequency of particular vocabulary, or structures etc.), or for natural language processing (as a test-bed for an automatic parser; as example text for example based translation, etc.). However, since the production of corpora is so expensive in terms of time and effort, the most desirable corpus would be one that is suited to both ends of the research spectrum. This is not so easy in practice -- the aims of these two approaches would be very different, and this would be reflected in the corpus produced, and the annotation applied to it. If a corpus is to be processed automatically, certain phenomena which may be problematic for automatic annotation may be left out, although these phenomena may be more interesting from a linguistic point of view (see further on underspecification). With reference more specifically to syntactic annotation, from a NLP perspective, the most important (and difficult) task may be the simple grouping together of certain parts of sentences into constituents, while from the linguist's perspective, this is the simplest (and mainly intuitive) task.

The production of syntactically annotated corpora is a relatively new phenomenon. As more corpora of this type are created, more potential uses will undoubtedly make themselves known. When designing an annotated corpus, it is essential to be aware of this variety of possible uses, since reusability and shareability of such an expensive resource is of great importance. The issue of reusability has recently become a central task in NLP, and therefore the potential multiple exploitations of an annotated corpus should be a primary point in the planning of a corpus annotation project.

Collaboration with existing projects

Because of the abovementioned issues of reusability and shareability of NLP resources including annotated corpora, we have endeavoured to make the guidelines presented in this document alignable with the annotation schemes used in relevant EC projects currently being undertaken. In order to achieve compatability of lexica and corpora, we have attempted to ensure that these guidelines comply with those of the EAGLES Lexicon/Syntax SubGroup (for more details see recommended annotations). The annotation scheme used in the TSNLP Project (Test Suites for Natural Language Processing) has also been considered in the formation of our guidelines.

Next: Syntax/semantics and syntax/discourse boundaries Up: Problems and issues Previous: Horizontalvertical and other