next up previous contents
Next: Underspecificationambiguity and ambivalence Up: Issues in practical application Previous: Issues in practical application



Phrase structure vs dependency

There are two major varieties of syntactic annotation: a phrase structure and a dependency representation. In general, a phrase structure representation may be found more suitable for languages with rather fixed word order patterns and clear constituency structures. Dependency representations, in contrast, may be found more adequate for languages which allow greater freedom of word order and in which linearisation is controlled more by pragmatic than by syntactic factors. This is the case in Finnish (and the Slavonic languages, some of which, such as Czech and Polish, may be expected to have increasing association with the EU in the future). Less obviously, some Romance languages (Italian and Spanish) may also benefit from a dependency representation. However, this does not mean that languages such as English should be annotated using a phrase structure representation or, vice versa, that for languages with greater freedom of word order, dependency should be used. Indeed, dependency structures have been successfully applied to English using the English Constraint Grammar (Karlsson et. al. 1995) and the Slot Grammar Parser (McCord 1990).

Since the approach to syntactic annotation is to a large extent influenced by the language to be annotated, our guidelines do not give any preference either to a phrase structure annotation or to a dependency annotation. The phrase structure annotation, however, is in certain ways the more demanding of the two, which is why this report covers phrase structure in more detail. This should not be construed, however, as expressing a preference for phrase structure annotation.

The two possibilities mentioned here, Dependency and simple Phrase Structure grammar models, are certainly not the only options available to annotate a corpus. Other approaches, such as LFG and complex phrase structure grammar models such as GPSG and HPS, may be equally successful. However, the reason why only phrase structure and dependency grammars are covered here is that by now these two models have a certain tradition in corpus annotation; and they have been used to annotate corpora both manually and automatically. Though it is true that HPSG parsers exist, there are no corpora, as far as we know, annotated using a HPSG formalism, nor are there any existing HPSG parsers robust enough and of sufficiently wide coverage to serve as a basis for corpus annotation.

We will propose notations for both approaches. A typical Phrase Structure tree is shown in 69:



The constituent structure and the labels can also be represented in a labelled bracketed structure, as in 70:

(70)  [NP The big dog NP] [VP chased [NP the cat NP] VP]
We propose that labels be put both at the opening and at the closing brackets. Though this may lead to many labels per constituent, it is nevertheless preferable for purposes of readability in order to distinguish multitudes of brackets. Compare 71, 72 and 73.

(71)  [PP in [NP the heat [PP of [NP the night]]]]
(72)  [in [the heat [of [the night NP] PP] PP] NP]
(73)  [PP in [NP the heat [PP of [NP the night NP] PP] NP] PP]
The labelling as in 71 is used in the Penn Treebank and partially in TOSCA: the contituents are labelled only at the opening brackets. The scheme in 72 is not used, as far as we know. The one in 73 is used in SUSANNE and in the IBM Paris scheme. We propose the use of the double-labelling convention of 73 for the sake of clarity.

Dependency trees can be represented with arrows pointing from the head to the dependents or from the dependents to the heads . Of these two conventions, we recommend the use of the latter, as in 74:



Arrows `depart' from dependents and point to their head. Thus, in 74 The and big are dependents of dog. We propose a scheme in which the words of the sentence are in a column, preceded by a column with word numbers. The dependencies can be indicated using these numbers as a reference. Further columns may be added for syntactic labels. Using this scheme, 74 is represented as in 75gif:

1The 3
2big 3
5the 6
In this scheme, there is actually no need for arrows since the dependencies and their direction can be inferred from the numbers in the fourth column: if the number in the fourth column is bigger than the number in the first column, the word's head follows, otherwise it precedes. For example, The is word number 1 with dependency index 3, so its head (dog, the third word) follows. Similarly, cat is word 6 with dependency index 4, so its head precedes. Note that the head of the sentence (chased) has no entry in the fourth column.

We note further that an alternative dependency representation has been suggested. Since dependency trees are directed a-cyclic graphs they can be represented by bracketed expressions, just as constituency trees. The governing term is placed first, and is enclosed in brackets, including all dependent terms which are themselves included in brackets. Example 74 can then be represented as in 76 (with part of speech categories added):

(76)  [V chased [N dog [det the] [Adj big]] [N cat [Det the]]]
This bracketed string can be represented more perspicuously as in 77 (with functional labels added; note that the indented format implies some form of constituency):

[PRED chased
    [SUBJ dog
         [DET the]
         [ATTR big] ]
    [DIROBJ cat
         [DET the] ] ]
This method can also be used for annotating the running text, as in 78, with the word tokens in their order of occurrence:

(78)  [PRED [SUBJ [DET the] [ATTR big] dog] chased [DIROBJ [DET the] cat]]
However, a disadvantage of this latter method is that it does not work for discontinuous structures, and since representing discontinuous structures is the strength of the vertical representation, it is the preferred one for languages in which discontinuity is a recurrent phenomenon.

As with phrase structure representations, less fine-grained analyses are possible too, as in 79 (see bracketing single word constituents):

(79)  [PRED [SUBJ the big dog] chased [DIROBJ the cat]]

next up previous contents
Next: Underspecificationambiguity and ambivalence Up: Issues in practical application Previous: Issues in practical application