next up previous contents
Next: Lancaster/IBM Treebank and the Up: Syntactically annotated corpora Previous: Syntactically annotated corpora

Preliminary Recommendations

Constraint Grammar

Introduction

The Constraint Grammar (Karlsson et al., 1995) Framework has been implemented with Two-level Morphology, to produce a system for syntactic analysis of unrestricted text. The most comprehensive system is ENGCG (for the analysis of written English) but the system is currently being developed for Finnish, Swedish, Danish, German, Basque and French.

The Constraint Grammar Framework differs from the syntactically annotated corpora under study in this section for three reasons:

  1. The aim of this work was the production of a system that may be used to annotate corpora, rather than the annotation of a particular corpus.
  2. The system produces a shallow surface syntactic analysis, based on dependency-oriented syntax, while other syntactically annotated corpora tend to use a constituent structure analysis.
  3. Ambiguities (occurring in 3-7% of all words) are left unresolved.

The analysis is carried out at word level, and all text words receive one or more morphosyntactic analyses, consisting of:

Method of annotation

Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way:

  1. Tokenisation;
  2. Lookup of morphological tags;
    1. Lexical component;
    2. Guesser;
  3. Resolution of morphological ambiguities;
  4. Lookup of syntactic tags;
  5. Resolution of syntactic ambiguities.

Tokenisation

The tokeniser identifies punctuation and multiword units, and splits enclitic forms into grammatical words.

Morphological lookup

This process begins with a lexical analysis based on a large lexicon including all inflected and central derived word forms. The lexical analyser assigns all possible morphological analyses to each word that is in the lexicon, and the remaining words are assigned an analysis by means of the guesser (a heuristic rule-based module). These rules are mainly governed by word shape, and if none of them apply, then a nominal analysis is given.

Resolution of morphological ambiguities

The rule-based Constraint Grammar parser is used to resolve some of the ambiguities at this stage. The constraints are partial paraphrases of form definitions of syntactic constructs such as the noun phrase. The English grammar for example, contains about 1,200 grammar-based constraints, plus 200 heuristic constraints.

Syntactic lookup

All possible syntactic tags are introduced for each word. This could, in some cases, mean that more than ten alternatives are given for one morphological reading.

Resolution of syntactic ambiguities

The parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage.

Syntactic tags

The English version of the Constraint Grammar marks the syntactic functions shown in table 3.4:

 

@+FAUXV finite auxiliary verb
@-FAUXV nonfinite auxiliary verb
@+FMAINV finite main verb
@-FMAINV nonfinite main verb
@SUBJ subject
@F-SUBJ formal subject
@OBJ object
@I-OBJ indirect object
@PCOMPL-S subject complement
@PCOMPL-O object complement
@APP apposition
@NPHR stray nominal
@N title
@O-ADVL object adverbial
@ADVL adverbial
@DN> determiner
@NN> premodifying noun
@AN> premodifying adjective
@QN> premodifying quantifier
@GN> premodifying genitive
@AD-A> premodifying ad-adjective
@<AD-A postmodifying ad-adjective
@<NOM-FMAINV postmodifying nonfinite verb
@<NOM other postmodifier
@<P-FMAINV nonfinite verb as complement of preposition
@<P other complement of preposition
@CC coordinator
@CS subordinator
@INFMARK infinitive marker
Table 3.4: ENGCG tags 

Example

As mentioned above, the syntactic tags are distinguished by the use of the `@' sign. The analysis is dependency based, but only partially. As can be seen in table 3.5, dependency relations are shown by the use of the left and right angle brackets, showing that a word is dependent on another to either the right of the left. In the example below, Karlsson is marked as `@<P' meaning that it is the complement of a preposition to be found previous to Karlsson (in this case the preposition by).

 



"<*i>" 
 "i" <*> <NonMod> PRON PERS NOM SG1 SUBJ @SUBJ
"<started>" 
 "start" <SV> <SVO> <P/on> V PAST VFIN @+FMAINV
"<work>" 
 "work" N NOM SG @OBJ
"<on>" 
 "on" PREP @ADVL
"<an>" 
 "an" <Indef> DET CENTRAL ART SG @DN>
"<*english>" 
 "english" <*> <Nominal> A ABS @AN>
"<description>" 
 "description" N NOM SG @<P
"<within>" 
 "within" PREP @<NOM @ADVL
"<the>" 
 "the" <Def> DET CENTRAL ART SG/PL @DN>
"<*constraint>" 
 "constraint" <*> N NOM SG @NN>
"<*grammar>" 
 "grammar" <*> N NOM SG @NN>
"<framework>" 
 "framework" N NOM SG @<P
"<proposed>" 
 "propose" <Vcog> <SVO> <SV> PCP2 @<NOM-FMAINV
"<by>" 
 "by" PREP @ADVL
"<*karlsson>" 
 "karlsson" <*> <Proper> N NOM SG @<P
"<$[>" 
"<1990>" 
 "1990" <1900> NUM CARD @ADVL
"<$;>" 
"<1994a>" 
 "1994a" <1994a> NUM CARD @ADVL
Table 3.5: ENCG output 





next up previous contents
Next: Lancaster/IBM Treebank and the Up: Syntactically annotated corpora Previous: Syntactically annotated corpora