Constraint Grammar

Next: Lancaster/IBM Treebank and the Up: Syntactically annotated corpora Previous: Syntactically annotated corpora

Preliminary Recommendations

The Constraint Grammar (Karlsson et al., 1995) Framework has been implemented with Two-level Morphology, to produce a system for syntactic analysis of unrestricted text. The most comprehensive system is ENGCG (for the analysis of written English) but the system is currently being developed for Finnish, Swedish, Danish, German, Basque and French.

The Constraint Grammar Framework differs from the syntactically annotated corpora under study in this section for three reasons:

The aim of this work was the production of a system that may be used to annotate corpora, rather than the annotation of a particular corpus.
The system produces a shallow surface syntactic analysis, based on dependency-oriented syntax, while other syntactically annotated corpora tend to use a constituent structure analysis.
Ambiguities (occurring in 3-7% of all words) are left unresolved.

The analysis is carried out at word level, and all text words receive one or more morphosyntactic analyses, consisting of:

A base form;
The morphological analysis (including a part of speech tag, inflection, derivation and subcategorisation; and
Dependency-oriented syntactic word tags (marked by the `@' sign)

Method of annotation

Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way:

Tokenisation;
Lookup of morphological tags;
1. Lexical component;
2. Guesser;
Resolution of morphological ambiguities;
Lookup of syntactic tags;
Resolution of syntactic ambiguities.

Tokenisation

The tokeniser identifies punctuation and multiword units, and splits enclitic forms into grammatical words.

Morphological lookup

This process begins with a lexical analysis based on a large lexicon including all inflected and central derived word forms. The lexical analyser assigns all possible morphological analyses to each word that is in the lexicon, and the remaining words are assigned an analysis by means of the guesser (a heuristic rule-based module). These rules are mainly governed by word shape, and if none of them apply, then a nominal analysis is given.

Resolution of morphological ambiguities

The rule-based Constraint Grammar parser is used to resolve some of the ambiguities at this stage. The constraints are partial paraphrases of form definitions of syntactic constructs such as the noun phrase. The English grammar for example, contains about 1,200 grammar-based constraints, plus 200 heuristic constraints.

Syntactic lookup

All possible syntactic tags are introduced for each word. This could, in some cases, mean that more than ten alternatives are given for one morphological reading.

Resolution of syntactic ambiguities

The parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage.

Syntactic tags

The English version of the Constraint Grammar marks the syntactic functions shown in table 3.4:

**Table 3.4:** ENGCG tags
@+FAUXV	finite auxiliary verb
@-FAUXV	nonfinite auxiliary verb
@+FMAINV	finite main verb
@-FMAINV	nonfinite main verb
@SUBJ	subject
@F-SUBJ	formal subject
@OBJ	object
@I-OBJ	indirect object
@PCOMPL-S	subject complement
@PCOMPL-O	object complement
@APP	apposition
@NPHR	stray nominal
@N	title
@O-ADVL	object adverbial
@ADVL	adverbial
@DN>	determiner
@NN>	premodifying noun
@AN>	premodifying adjective
@QN>	premodifying quantifier
@GN>	premodifying genitive
@AD-A>	premodifying ad-adjective
@<AD-A	postmodifying ad-adjective
@<NOM-FMAINV	postmodifying nonfinite verb
@<NOM	other postmodifier
@<P-FMAINV	nonfinite verb as complement of preposition
@<P	other complement of preposition
@CC	coordinator
@CS	subordinator
@INFMARK	infinitive marker

Example

As mentioned above, the syntactic tags are distinguished by the use of the `@' sign. The analysis is dependency based, but only partially. As can be seen in table 3.5, dependency relations are shown by the use of the left and right angle brackets, showing that a word is dependent on another to either the right of the left. In the example below, Karlsson is marked as `@<P' meaning that it is the complement of a preposition to be found previous to Karlsson (in this case the preposition by).

**Table 3.5:** ENCG output
"<*i>"
	"i" <*> <NonMod> PRON PERS NOM SG1 SUBJ @SUBJ
"<started>"
	"start" <SV> <SVO> <P/on> V PAST VFIN @+FMAINV
"<work>"
	"work" N NOM SG @OBJ
"<on>"
	"on" PREP @ADVL
"<an>"
	"an" <Indef> DET CENTRAL ART SG @DN>
"<*english>"
	"english" <*> <Nominal> A ABS @AN>
"<description>"
	"description" N NOM SG @<P
"<within>"
	"within" PREP @<NOM @ADVL
"<the>"
	"the" <Def> DET CENTRAL ART SG/PL @DN>
"<*constraint>"
	"constraint" <*> N NOM SG @NN>
"<*grammar>"
	"grammar" <*> N NOM SG @NN>
"<framework>"
	"framework" N NOM SG @<P
"<proposed>"
	"propose" <Vcog> <SVO> <SV> PCP2 @<NOM-FMAINV
"<by>"
	"by" PREP @ADVL
"<*karlsson>"
	"karlsson" <*> <Proper> N NOM SG @<P
"<$`[`>"
"<1990>"
	"1990" <1900> NUM CARD @ADVL
"<$;>"
"<1994a>"
	"1994a" <1994a> NUM CARD @ADVL

"<*i>" "i" <*> <NonMod> PRON PERS NOM SG1 SUBJ @SUBJ "<started>" "start" <SV> <SVO> <P/on> V PAST VFIN @+FMAINV "<work>" "work" N NOM SG @OBJ "<on>" "on" PREP @ADVL "<an>" "an" <Indef> DET CENTRAL ART SG @DN> "<*english>" "english" <*> <Nominal> A ABS @AN> "<description>" "description" N NOM SG @<P "<within>" "within" PREP @<NOM @ADVL "<the>" "the" <Def> DET CENTRAL ART SG/PL @DN> "<*constraint>" "constraint" <*> N NOM SG @NN> "<*grammar>" "grammar" <*> N NOM SG @NN> "<framework>" "framework" N NOM SG @<P "<proposed>" "propose" <Vcog> <SVO> <SV> PCP2 @<NOM-FMAINV "<by>" "by" PREP @ADVL "<*karlsson>" "karlsson" <*> <Proper> N NOM SG @<P "<$[>" "<1990>" "1990" <1900> NUM CARD @ADVL "<$;>" "<1994a>" "1994a" <1994a> NUM CARD @ADVL Table 3.5: ENCG output

Next: Lancaster/IBM Treebank and the Up: Syntactically annotated corpora Previous: Syntactically annotated corpora