Problems and issues

Next: Syntactic annotation of corpora Up: Introduction Previous: Definitions

Recommendations

Problems and issues

Syntactic annotation is more complex and variable than morphosyntactic annotation. In the case of morphosyntactic annotation, the units to be labelled are to a large extent defined in advance (text words being orthographically indicated). In syntactic annotation, not only do we have to determine which labels to apply to segments of the text, but the segments to which they apply have to be chosen from among many possibilities. The way these segments relate to one another also has to be determined. Fortunately, there is considerable consensus about the syntactic segments which have to be recognised in syntactic annotation - e.g. noun phrases and prepositional phrases (there is less consensus about other segments -- e.g. verb phrases and subordinate clauses). On the other hand, there is less consensus about how syntactic segments should be defined, as illustrated by the following anecdote (in Sampson 1995:4). During the annual conference of the Association of Computational Linguistics in 1991, NLP researchers from nine institutions were asked to specify the bracketing of a number of example sentences. One of the examples was 1; the brackets in it were the only ones which the nine researchers could agree on.

(1)	He said this constituted a [very serious] misuse [of the [Criminal Court] processes].

Thus, the adjective phrase very serious, the prepositional phrase of the Criminal Court processes and the nominal constituent Criminal Court were the only constituents for which consensus could be reached.

Sentences are normally considered to be the maximal units of syntactic analysis or parsing. For our present purpose, the minimal unit is the word. There are therefore two primary points at which syntactic annotation correlates with text: the word and the sentence. In theory, the syntactic definition of these units is independent of the orthographic definition (see Sampson 1995: 153-156). However, in practice the default assumption is that the orthographic word/sentence and the syntactic word/sentence are coextensive.

Next: Syntactic annotation of corpora Up: Introduction Previous: Definitions