Lancaster/IBM Treebank and the IBM Paris Treebank

Next: TOSCA annotation scheme Up: Syntactically annotated corpora Previous: Constraint Grammar

Preliminary Recommendations

Lancaster/IBM Treebank and the IBM Paris Treebank

This section will discuss the IBM Paris Treebank, and the various Lancaster schemes. Although many changes have been made throughout the various incarnations of the Lancaster syntactic annotation schemes, for our purposes they may be considered together. The Paris scheme (for French) was intended as a parallel scheme to the Lancaster English annotation scheme, and can therefore also be discussed with the various Lancaster schemes.

Introduction

Both these schemes use a constituent structure analysis called skeleton parsing, which is a limited form of parsing, that indicates the structure of a sentence in terms of major constituents, i.e. sentence types, predicates, clauses and major phrase types. The constituents are marked using square brackets, which correspond to the phrase structure markers of tree diagrams. The schemes were developed with two main considerations in mind:

The scheme should be as uncontroversial as possible from the linguistic point of view, and unlikely to be affected by theoretical difference, thus producing a consistent annotation.
The scheme should be simple and easy to learn.

Method of annotation

The corpora were annotated manually by a team of grammarians, using a screen editor program, which validates the parse only to the extent of checking for unbalanced brackets.

Syntactic tags

The labelled constituents of the UCREL Skeleton Parsing System are shown in table 3.6:

**Table 3.6:** UCREL Skeletal Parsing System tags
Fa	Adverbial clause
Fc	Comparative clause
Fn	Noun clause
Fr	Relative clause
G	Genitive
J	Adjective phrase
N	Noun phrase
Nn	Metalinguistic constituent
Nr	Temporal adverbial noun phrase
Nv	Non-temporal adverbial noun phrase
P	Prepositional phrase
S	Direct speech
Si	Interpolated or appended sentence
Tg	-ing clause
Ti	to + infinitive clause
Tn	Past participle clause
V	Verb phrase
&	First conjunct
+	Second, (etc.) conjunct
@	Discontinuity marker

Unlabelled brackets may also be used in cases where a constituent is not one of the types for which a label is authorised, and the grammarian is convinced that the sequence of words is a constituent.

Most of the abovementioned labels represent categories that are easily identifiable and non-controversial. The tag `Nn' for a metalinguistic constituent was introduced in order to deal with the problem of computerese verbal-name constructions, often found in the IBM manuals being processed, e.g.

       the [Nn enter Nn] key

Similar expressions will occur in other genres, and this constituent was therefore included in the general scheme, for example:

       a [Nn rent a bike Nn] scheme

The French scheme developed at IBM Paris is almost identical to the above shown scheme, but the following labels were added:

Attributive adjective phrase
Adverbial prepositional phrase
Infinitive verb phrase

Examples

A partially parsed sentence from the American Printing House for the Blind corpus is shown in table 3.7:

**Table 3.7:** APHB output
B01000 1 v
[N The_AT world_NN1 N][V owes_VVZ [N a_AT1
considerable_JJ debt_NN1 [P of_IO [N gratitude_NN1
N]P]N][P[P&g to_II [N Mr._NNSB1 Byles_NP1 [N
the_AT Butcher_NN1 N]N]P&g] ,_, or_CC [P+ to_II [N
the_AT [ less_RGR generic_JJ ] alias_NN1 [Fr[P
under_II [N which_DDQ N]P][N this_DD1 indispensable_JJ
tradesman_NN1 N][V supplied_VVD [N the_AT needs_NN2
[P of_IO [N the_AT Stevenson_NP1 family_NN1 N]P]N]
[P in_II [N Bournemouth_NP1 N]P][P in_II [N 1885_MC
N]P]V]Fr]N]P+]P] ,_, [Fa since_ICS [N Stevenson_NP1 N]
[V explained_VVD [Fn that_CST [[N it_PPH1 N][V was_VBDZ
[N the_AT bills_NN2 [P of_IO [N Mr._NNSB1 Byles_NP1 N]
P]N]V]][[N which_DDQ N][V drove_VVD [N him_PPHO1 N]
[P to_II [N the_AT writing_NN1 [P of_IO [N The_AT
Strange_JJ Case_NN1 [P of_IO [N[N&g Dr._NNSB1 Jekyll_NP1
N&g] and_CC [N+ Mr._NNSB1 Hyde_NP1 ._.
N+]N]P]N]P]N]P]V]]Fn]V]Fa]V]

In this example, there is a cleft sentence within an adverbial clause, and this is marked by unlabelled brackets. The scheme does not make any allowances for marking any more detail that that which is shown in the surface structure. This is only to be expected with the theory-neutral approach, and intention to be uncontroversial, of the scheme designers.

The following example shows a parsed sentence from the IBM Computer Manual Corpus:

    

    M1154602 v

    [N Files_NN2 N][V[V& come_VV0 [P into_II 

    [N the_AT print_NN1 queue_NN1 N]P]V&] 

    and_CC [V+ either_LE [V&[V& match_VV0 

    [N[G a_AT1 printer_NN1 's_$ G] setup_NN1 

    N]V&] (_( [V+ get_VV0 [Tn printed_VVN 

    Tn]V+] )_) V&] or_CC [V+[V& do_VD0 

    not_XX match_VVI V&] (_( [V+ wait_VV0 

    V+] )_) V+] V+]V] ._.

In this short sentence, complex verbal coordination is shown, exhibiting

the use of the `&' and `+' signs for coordination.

Below are two example parsed sentences from the French corpus, the second of which illustrates the labelling of coordinated sentences, which follows the Lancaster approach for English:

[N Vous_PPSA5MS N] [V accedez_VINIP5 [P a_PREPA [N

cette_DDEMFS session_NCOFS N] P] [Pv a_PREP31 partir_PREP32

de_PREP33 [N la_DARDFS fenetre_NCOFS [A Gestionnaire_AJQFS [P

de_PREPD [N taches_NCOFP N] P] A] N] Pv] V] ._.

[Z [Z& [N L' industrie N] [V n'EG a pas reussi [P [Vi etablir [N des

prix [A satisfaisants A] N] [Pv pour [N ces grains N] Pv] Vi] P] V] Z&]

 , mais [Z+ [N le ministre N] [N lui N] , [V a reussi V] Z+] Z] .

Next: TOSCA annotation scheme Up: Syntactically annotated corpora Previous: Constraint Grammar