A comparative overview

Next: Summary Up: Practical NLP lexicons Previous: Approaches to verb subcategorisation

Preliminary Recommendations

A comparative overview

In what follows, the different lexicons are characterised in more detail with respect to those aspects which are taken to be relevant to the present overview. These aspects define a data-driven, observational basis of comparison which will be beneficial to an overall assessment of the quantity and quality of linguistic information that is shared as a common core by all lexicons.

Number of arguments

There appears to be substantive agreement on the presence or need of information on argument number, which is coded either directly, or indirectly through argument listing within the relevant subcategorisation pattern or through use of shorthand grammar codes. This issue is strictly related to the way droppable arguments are represented (argument optionality): a verb with two arguments, one of which is droppable, can either be represented as two separate entries (one with two arguments and the other with the obligatory argument only), or as one complex entry, where argument optionality is expressed as an extra property of a single frame. For more detailed information on these aspects, see the relevant subsection. Here, we concern ourselves only with whether and how argument number information is taken into account.

The only lexicon which explicitly encodes this information through an appropriate feature which takes a numeric value is ILCLEX. As to the others, this information is always to be somehow inferred.

In Acquilex a primary classification of verb types is obtained in terms of predicate arity, which is logically specified as a conjunction of formulae whose main predicates are (thematic) relations between eventualities and individuals. The number of arguments can be inferred from the number of such individuals.

In Eurotra, the number of obligatory complements/arguments of the verb is treated at the levels of both relational and interface structure. The maximum number of complements in the relational structure is determined by the maximum number of arguments in the interface structure and is equal to 4, a number judged appropriate for general language.

The number of complements in Genelex corresponds to the number of complement positions defined within the subcategorisation pattern of each lexical entry. There is no limit to the number of complements; the maximum number is determined by the lexicographer.

In Comlex the number of arguments can be inferred from the number of complements specified at the level of grammmatical structure. Note that Comlex, at the level of constituent structure, does not specify the subject argument in the complement list, unless the subject has to be morphosyntactically constrained somehow (as in verbs which take plural subjects only: see also the section on morphosyntactic constraints).

In both LDOCE and PLNLP, subcategorisation information is expressed through use of conventional grammatical codes rather than argument listing (e.g. TRANS for transitive verbs in PLNLP, or I for intransitive verbs with no object in LDOCE). Thus, the number of arguments is to be recomputed on demand on the basis of the code (although this is not possible in all cases).

Argument syntactic category

The syntactic category of arguments is indicated in all lexicons, albeit indirectly in some cases. The set of categories used may vary considerably. Furthermore, differences in surface syntactic realisation of complements may have different consequences on the definition of what constitutes a unique lexical entry. In what follows, these two aspects are considered in more detail for each lexicon.

In Eurotra, the syntactic realisation of arguments is dealt with at the level of relational structure. The following phrasal categories may function as verbal complements: ADVP (adverbial phrase), AP (adjective phrase), NP (noun phrase), PP (prepositional phrase) and S (completive sentence -- which may be finite or non finite). Different syntactic realisations of the same complement can be collapsed within the same lexical entry as long as their functional role does not vary.

In Genelex a complement may be realised either by a terminal or a non-terminal category, and this information is to be encoded obligatorily:

The set of non terminal categories includes NP (noun phrase), PP (prepositional phrase), AP (adjective phrase), ADVP (adverbial phrase), DETP (determiner phrase), VP (verbal phrase) and S (sentence).
The set of terminal categories includes noun, adjective, adverb, verb, preposition, conjunction, interjection, determiner, pronoun, particle.

Alternative syntactic realisations of the same complement are expressed through a distribution paradigm. To give a concrete example, the distribution paradigm of the direct object complement of the verb expect will specify the syntactic categories NP, that-clause and infinitival clause to account for the following examples:

He expected some benefits.
He expected to receive some benefits.
He expected that they will receive some benefits.

Features of different types can be combined with these category specifications to allow for more fine grained restrictions, including lexical, morphological, morphosyntactic, syntactic, syntacticosemantic and semantic ones. For example, the syntactic category S can be further specified as to its type, e.g. completive, indirect-interrogative, or infinitive.

ILCLEX specifies the syntactic realisation of verb arguments at the level of subcategorisation frame. Each configuration of different syntactic realisations of arguments gives rise to a different subcategorisation frame. The list of syntactic categories follows: NP, PP, AP, ADVP, IND-THAT-CLAUSE (indicative that-clause), SUBJUNCT-THAT-CLAUSE (subjunctive that-clause), IND-WH-CLAUSE (indicative wh-clause), SUBJUNCT-WH-CLAUSE (subjunctive wh-clause), A-INF (infinitive clause introduced by the preposition a), DI-INF (infinitive clause introduced by the preposition di), BARE-INF (infinitive clause introduced by no preposition) and WH-INF (infinitive clause introduced by a wh-element). The reader is reminded that different subcategorisation frames are subsumed under the same pattern, i.e. verb entry.

A different approach is presented in Comlex, where syntactic category information, which is encoded at the level of constituent structure, is criterial for the definition of independent lexical entries. For example, two different constructions of the verb help -- as in he helps to hang the laundry and he helps hang the laundry, respectively with and without use of to before the verbal complement -- are grouped in two different verb classes (see definition of frame group in Comlex), and, consequently, encoded as two separate entries.The list of syntactic categories used in Comlex follows: NP, PP, POSS (for possessive NP), PREP (for preposition), PART (for adverbial particle), ADJP, ADVP, VP and S.

In Acquilex, which follows the Unification Categorial Grammar model, each sign (subcategorised signs included) is specified for its syntactic category, which can be either basic or complex. Basic categories are binary feature structures consisting of a category type, and a series of attribute value pairs encoding morphosyntactic information:

        [CAT-TYPE: cat-type

          M-FEATS: m-feats]

Three basic category types are used: `n' (noun), `np' (noun phrase) and `sent' (sentence). Morphosyntactic features are included only where needed. Complex categories are recursively defined by letting the type `cat' instantiate a feature structure with attributes RESult, DIRection and ACTive. RESult can take as value either a basic or complex category, ACTive is of type `sign', and the direction attribute encodes order of combination relative to the active part of the sign (e.g. forward or backward):

        [RES: cat

         DIR: dir

         ACT: sign]

In LDOCE, the grammar codes expressing the verb subcategorisation information implicitly include specifications on the syntactic realisation of its complements which can easily be derived by automatic procedures. See the contrast, for instance, between the code T1, used to mark transitive verbs with an NP object, and the code T3 used for transitive verbs with an infinitival clause as object.

PLNLP conveys information on the syntactic category of complements either through use of grammar codes which are assumed to be interpreted by default (e.g. transitive verbs are normally expected to take two NP arguments) or by explicit encoding of category information when the syntactic realisation of complements deviates from what is assumed by default (e.g. with transitive verbs taking an infinitival clause as object, the relevant subcategorisation features are TRANS and INFPCOMP).

Argument functional role

Argument functional roles (or grammatical functions) are always indicated either explicitly (Genelex, Comlex, Eurotra, ILCLEX) or implicitly (PLNLP, LDOCE, Acquilex).

In Genelex, functional roles are assigned with reference to subcategorised positions (complements). Genelex suggests a list of basic functions, which may nonetheless be modified or integrated by the lexicographer.

In Comlex, functional roles are specified at the level of grammatical structure. The linking to the corresponding constituent structure is expressed by means of coindexation. Available grammatical relations are: Subject, Obj, Obj2, Comp, Prep (for bare prepositions) and Part (for bare particles). Note that the subject role is always indicated at the level of grammatical structure; this is in contrast with the fact (already mentioned above) that the subject constituent is given no explicit mention in the constituent structure unless it is morphosyntactically constrained. It is worth noting here that, unlike Genelex, Comlex grammatical functions may vary depending on the syntactic realisation of the complements (e.g. NP objects and clausal objects are not assigned the same functional label, as sentential objects are assigned a distinct role, i.e. Comp).

Eurotra treats information about the grammatical function of complements at the level of the relational structure. Nine grammatical functions are defined at this level: Subject, Direct Object, Indirect Object, By-agent (in passive constructions), Attribute to subject (with attributive verbs), Attribute to object (with transitive-attributive verbs), Strongly bound PP, Weakly bound PP and Sentential complement.

In ILCLEX, functional roles are encoded at the pattern level. The following set of grammatical functions is provided: subject, object, predicative-subject, predicative-object, dative, concern (covering different instances of the genitive case), locative (locative complements), mod (adverbs or adverbial locutions) and lastly instrument (instrumental complement).

As for LDOCE and PLNLP, functional role information, like information about syntactic category, is to be inferred from verb grammar codes.

In Acquilex, functional roles are not encoded as primitive notions, but are nonetheless indirectly expressed in terms of position in category structure according to the `obliqueness hierarchy'.

Control and raising phenomena

This information is always expressed in all lexicons, although some of them make no explicit distinction between control and raising. Genelex, Acquilex and PLNLP use the term control with reference to both equi and raising verbs, e.g. John wants to leave and John seems to be happy. No special category types expressing the equi-raising distinction are deemed to be necessary, since the contrast is located only at the semantic level, where Acquilex makes provision for a fine grained set of verb semantic classes (i.e. the equi verb class is expressed as the class of verbs where the controller NP is assigned a thematic role, whereas the raising verb class is defined as the class of verbs in which the controller NP is non thematic). By contrast, Comlex, Eurotra, ILCLEX and LDOCE encode control and raising as two separate phenomena.

Regardless of whether the equi/raising distinction is made or not, control information is encoded in two different ways:

Through coindexation of the controller NP with (the subject of) the infinitival clause (Genelex and Acquilex);
By means of feature specification (PLNLP, ILCLEX, LDOCE and Comlex).

In Eurotra, information about the control properties of verbs appears encoded at the relational level dictionary by means of a feature specifying the syntactic function of the controller element: subject, direct object or indirect object. The coindexation between this element and the empty subject of the infinitival clause is then performed by grammar rules.

Lexical selection

By lexical selection we mean, for present purposes, the specific lexical requirements that a verb imposes on its subcategorised context, e.g. the selection of a particular preposition introducing a given complement. All lexicons considered here express, in some way or another, this type of information, with varying degrees of granularity. Typical instances of lexical selection are: bound prepositions, particles, complementisers, impersonal subjects and clitics. Bound prepositions and particles are specified in all lexicons through listing and/or by indication (when possible) of a relevant class of prepositions or particles (as in the case of locative prepositions). For those lexicons whose information is articulated over different levels of linguistic description, lexical selection is specified at the syntactic level: the level of constituent structure for Comlex and Eurotra, in the CAT attribute of the verb sign in Acquilex, at the subcategorisation level within ILCLEX. Complementisers are explicitly listed in Comlex and Genelex, while in the others they are referred to indirectly through use of argument class (e.g. WH-complements and THAT-complements in PLNLP and ILCLEX). Impersonal subjects are registered through direct reference in Comlex (at the level of constituent structure, where, as already pointed out above, lexically constrained subjects, such as it subjects, are obligatorily specified) and Genelex. Again, other lexicons rather make use of verb class codes (e.g. the label IMPERSONAL used in PLNLP refers to verbs which can only occur in an impersonal construction). Genelex deals with clitics as a particular case of lexical selection. A language-specific case of lexical selection is auxiliary selection in those languages where this information is subject to lexically-governed variation (e.g. Italian and French).

Morphosyntactic constraints

Under the heading of morphosyntactic constraints a wide variety of phenomena, both general and language specific, is gathered. In what follows, we provide a (partial) list of the types of constraints encoded, with an indication of the lexicon which explicitly deals with them. Note that, given their rather heterogeneous nature, it is difficult to state with certainty whether they are taken into account in each lexicon and by what formal means. It may well be that this information is presupposed by a specific verb class (e.g. the feature passivisability is in some cases implied by the class of transitive verbs), in which case it is not mentioned in this section.

Passivisability: -- This is treated as a morphosyntactic constraint on the verb entry within Comlex and Eurotra (where it is dealt with at the constituent structure level). In other lexicons, passivisability is rather classified as a case of frame alternation (see the relevant section below).
Morphosyntactic restrictions on the arguments: -- These restrictions refer to morphosyntactic properties (e.g. number or gender) that arguments have to exhibit when co-occurring with a given verb. For instance, plural or singular only subjects are specified in most of the lexicons taken into account. In this respect, the reader is reminded that in Comlex, where the subject constituent is usually omitted at the level of constituent structure, the subject is nonetheless overtly expressed therein when and only when it is morphosyntactically constrained. In this case, constraints are specified immediately to its right in the constituent list.
Other kinds of morphosyntactic restriction: -- Constraints are introduced for encoding verbs which occur with a given complement structure provided that certain morphosyntactic conditions are met. The argument structure of these verbs is qualified through specification of some additional parameters: e.g. the fact that one can say she didn't realise whether... but cannot say *she realised whether... is captured in Comlex through the condition `neg t' (i.e. `negative = true').
Language-specific morphosyntactic constraints on the verb only: -- For Italian and Spanish there are verbs which in certain readings, most frequently intransitive ones but not only, take an obligatory clitic pronoun (marking the pronominal form). This is a lexically-governed property which must be specified in any Spanish or Italian lexicon; ILCLEX (at the pattern level), PLNLP and Eurotra specify it.

Frame alternation and argument optionality

By frame alternation, we mean, in this context, the possibility for a given verb to appear in at least two different subcategorisation fames which can be related to each other through a limited set of mapping relations. Well-known instances of frame alternation are the causative/inchoative alternation (as between Kim broke the glass and The glass broke), the indefinite object alternation (as in John ate a sandwich and John ate) and the locative alternation (between -- say -- John loaded hay on the cart and John loaded the cart with hay).

The lexicons taken into account for our purposes vary considerably in the way this information is encoded and in the range of phenomena treated under this rubric. To give an example, argument optionality is not always dealt with as a case of frame alternation. In Eurotra, different subcategorisation frames of the same verb give always rise to distinct lexical entries. On the contrary, many other lexicons express the relation between two (or more) frames explicitly. In Acquilex, only one member of the alternation is stored in the lexicon, while the others are derived from the stored entry through application of lexical rules. Constraints on the applicability of lexical rules are enforced through feature specification within the basic lexical entry. A similar approach is available in Genelex, where nonetheless a second alternative is also provided: all alternating frames are listed on a par and the links relating one frame to the other are marked. This is also done in ILCLEX through pattern rules. In Comlex, the notion of frame group is introduced to deal with frame alternation. A frame group is a family of alternating simple frames. For example, ditransitive verbs are assigned the same frame group, which encompasses both `v np to np' (as in she gave a kiss to her mother) and `v np np' (as in she gave her mother a kiss) frames. This leads to a multiplication of verb classes which could in principle be avoided. On the other hand, the Comlex approach does not force the lexicon-writer to choose either frame as basic. LDOCE expresses frame alternation only implicitly through grammar code disjunction. The PLNLP approach is similar, with the main difference that verb senses are not distinguished.

Argument optionality is not always interpreted as a case of frame alternation. In fact, Genelex conveys information about optional arguments in one single frame, by enclosing them between brackets. The possibility of expressing constraints on the cooccurrence of optional and/or mandatory complements is also provided. In Comlex a different approach is taken; for instance an object deleting verb such as eat is assigned two separate subcategorisation frames (i.e. `v np' and `v') which are in their turn encoded in two distinct and unrelated lexical entries. In other lexicons (e.g. ILCLEX) strict obligatoriness of complements, rather than optionality, is marked.

Passivisability is treated as an instance of frame alternation in Acquilex and Genelex. Yet, this is not always the case: as already pointed out above, passivisability can be seen as a kind of morphosyntactic constraint or as an inherent property of transitive verbs.

Theta structure and deep structure

Only two of the lexicons considered in the present survey encode information at this level of linguistic description at the boundary between syntax and semantics: Acquilex and Eurotra. In Acquilex, following Dowty, thematic relations are expressed as cluster concepts determined for each choice of predicate through attribution of selected entailments (e.g. volitional involvement, sentient, movement, etc.) which qualify the relative agentive strength and affectedness of event participants. Only three thematic role-like concepts are recognised: proto-agent, proto-patient and prep (the last one being added by Sanfilippo within the Acquilex framework). In Eurotra, this information is encoded at the level of interface structure. The argument roles used to describe the deep structure are: external argument, internal argument, second participant, second entity, attribute of the subject, attribute of the object, dative perceiver (with raising verbs) and adjuncts (measure, associated, place, origin, goal).

Other properties

So far, we have presented aspects of lexical representation that were shared by more than one lexicon. Some other pieces of lexical information are provided by one lexicon only. To give but one example, Eurotra encodes aspectual information as well as selectional restrictions on verb arguments. They are intentionally neglected in this context, as they are not amenable to an inter-lexicon comparison.

Next: Summary Up: Practical NLP lexicons Previous: Approaches to verb subcategorisation