Next: Experimental NLP lexicons Up: Lexical Semantic Resources Previous: Unified Medical Language System

Lexicons for Machine-Translation

In this section we discuss several computerized lexicons that have been developed for Machine Translation applications: Eurotra, Cat-2, Metal, Logos and Systran (see §4.1). They have a high degree of formalization as compared to traditional dictionaries but the information is specifically structured to solve translation problems.

Eurotra Lexical Resources

Eurotra is a transfer based and syntax driven MT system which deals with 9 languages (Danish, Dutch, German, Greek, English, French, Italian, Spanish and Portuguese). Monolingual and Bilingual Lexical resources were developed for all the languages involved, size and coverage of those were similar for all.

We will only supply figures for Spanish as an illustration in Table 3.10.

Table 3.10: EUROTRA - Spanish

	All PoS	Nouns	Verbs	Adjectives	Adverbs	Other
Number of Entries	2193	941	444	465	269	yes
Number of Senses	2881	1322	740	396	270	yes
Morpho-Syntax	yes
Synsets	no
Sense Indicators	yes
- Indicator Types	1
Semantic Network	no
Semantic Features	yes
- Features	41
Multilingual Relations	yes
Argument Structure	yes
Selection Restrictions	yes
Domain Labels	no
Register Labels	no

Eurotra dictionaries are organized according to a number of levels of representation: Eurotra Morphological Structure (EMS), Eurotra Constituent Structure (ECS), Eurotra Relational Structure (ERS) and Interface Structure (IS). The IS is the basis for transfer and although it reflects deep syntactic information it is also the level were semantic information is present.

The Eurotra IS level is an elaboration of dependency systems in that every phrase is made up of a governor optionally followed by dependants of two types: arguments and modifiers. Arguments of a given governor are encoded in the lexicon. The relations between governors and their arguments are not explicitly stated. The set of arguments, are:


arg1:       subject (experiencer/causer/agent)

arg2:       object (patient/theme/experiencer)

arg_2P:	    2nd participant (goal/receiver (non-theme))

arg_2E:	    2nd entity (goal/origin/place (non-theme))

arg_AS:	    secondary stative predication on subject		

arg_AO:	    secondary stative predication on object	

arg_Pe:	    dative perceiver with raising predicates

arg_ORIGIN: oblique		

arg_GOAL:   oblique

arg_MANNER: oblique

Not all labels have the same theoretical status nor correspond to the same level of depth in analysis. Thus,

Sometimes ERS and IS functions express identical grammatical relations. This is the case of subject or object attribute (arg_AS and arg_AO respectively).
Sometimes IS functions neutralize surface variation, so that different ERS syntactic functions go to the same IS argument. Thus arg2 includes NPs, VPs, SCOMPs and PPs (as in 'John wants bananas (NP)', 'John wants to go (VP)' 'John said that ... (SCOMP)' 'John believes in God (PP)' also for 'dative shift alternation' there is only one IS representation:
John told Mary (arg2P) a story (arg2)
John told a story (arg2) to Mary (arg2P)
Sometimes IS roles establish more fine-grained distinctions than ERS functions. Thus, unergative and unaccusative verbs share the same ERS-frame but differ at IS where the subject is projected into arg1 for the former and into arg2 for the latter (eg, John (arg2) arrived). Also adjuncts are further semantically specified (origin, goal, manner, etc.)

Essentially the semantic information encoded in E-dictionaries is used for disambiguation purposes. These include (i) structural ambiguity, (ie. argument modifier distinction, specially in the case of PP-attachment) and (ii) lexical ambiguity in lexical transfer, that is collocations (restricted to verb support constructions), homonymy and polysemy (this is further explained in §4.1.4).

All information is encoded as Feature-Value pairs, in ASCII files. Here is an example:


absoluto_1 =

{cat=adj,e_lu=absoluto,e_isrno='1',e_isframe=arg1,e_pformarg2=nil,term='0'}.

Information encoded depends on the category. For all categories, the category (cat=), lema (e_lu=) and reading number (e_isrno=) is encoded. Other information is, for nouns and verbs: deep syntactic argument structure (e_isframe) as explained above, argumental strongly bound prepositions required by the lexical item (e_pformargX), selectional restrictions for all the arguments (semargX=) and the semantic type of the lexical item (sem=).

Reading number refers to a meaning distinction usually also reflected in a difference in the encoding of the other atributes. In the case of centro (``center") the meaning distinction is referred to in the ``sem" attribute" is: coordinate vs. place (lug). Besides, the reading ``place" has no argumental structure while the reading ``coord" can have one argument (``e_isframe=arg1"), and this has to be ``concrete" in oposition to ``abstract entity".


centro_1 =

{cat=n,e_lu=centro,e_isrno='1',e_gender=masc,person=third,nform=norm,

nclass=common,class=no,e_isframe=arg1,e_pformarg1=de,e_pformarg2=nil,

e_pformarg3=nil,sem=coord,semarg1=conc,semarg2=nil,semarg3=nil,

exig_mood=nil,e_predic=no,wh=no,whmor=none,e_morphsrce=simple,

term='2000000538'}.


centro_2 =

{cat=n,e_lu=centro,e_isrno='2',e_gender=masc,person=third,nform=norm,

nclass=common,class=no,e_isframe=arg0,e_pformarg1=nil,e_pformarg2=nil,

e_pformarg3=nil,sem=lug,semarg1=nil,semarg2=nil,semarg3=nil,

exig_mood=nil,e_predic=no,wh=no,whmor=none,e_morphsrce=simple,term='0'}.

Other strictly monolingual information encoded for nouns is: gender, person, type of noun (``nform" and ``nclass"), if it requires a specific verbal mood (exig_mood) when creating a subordinate clause, if the noun is predicative (``e_predic"), information about relatives (wh and whmor), morphological derivative information (e_morphsrce, refers to morphological source, i.e, derivate...), and terminological identification: ``term".


basar_1 =

{cat=v,e_lu=basar,e_isrno='1',e_isframe=arg1_2_PLACE,e_pformarg1=nil,

e_pformarg2=nil,e_pformarg3=en,e_pformarg4=nil,p1type=nil,p2type=nil,

semarg1=anim,semarg2=ent,semarg3=ent,semarg4=nil,e_vtype=main,

vfeat=nstat,term='0',erg=yes}.

As said before, information about strongly bound prepositions is encoded for all the arguments, and in case the verb preposition is weakly bound, 2 features corresponding to 2 possible complements might refer to a class of prepositions such as ``origin", ``goal", etc. As for nouns, selectional restrictions are encoded but no semantic typing of the verb itself. Specific monolingual information is encoded in the following attributes: ``e_vtype", refers to the traditional main vs. auxiliar distinction, and ``erg" refers to ergative verbs. Aspectual characterization of the verb is encoded in ``vfeat", with possible values stative, non stative.

CAT-2 Lexical Resources

The CAT2 system, developed at IAI (Saarbruecken), is a direct descendant of Eurotra and was designed specifically for MT [Ste88], [Zel88], [Mes91]. The CAT2 system exploits linguistic information of different kinds: phrase structure, syntactic functional information and semantic information. The figures supplied in Table 3.11 provide an indication of size and coverage.

Table 3.11: CAT-2

	All PoS
Number of Entries	German 20000
Number of Entries	English 30000
Number of Entries	French 30000
Number of Senses	German 40000
Number of Senses	English 50000
Number of Senses	French 50000
Senses/Entry	German 2
Senses/Entry	English 1.6
Senses/Entry	French 1.25
Morpho-Syntax	yes
Synsets	no
Sense Indicators	no
Semantic Network	no
Semantic Features	yes
- Feature Types	60
Multilingual Relations	yes
Argument Structure	yes
- Semantic Roles	7
Semantic Frames	no
Selection Restrictions	yes
Domain Labels	yes
- Domain Tokens	5000
Register Labels	yes
- Register Tokens	4

Semantic information is essentially used for reducing syntactic ambiguity, disambiguation of lexical entries, semantic interpretation of prepositional phrases, support verb constructions, lexical transfer and calculation of tense and aspect.

A verbal entry for the IS level example is:


apply1  =

%% He applied the formula to the problem.

{lex=apply,part=nil,VOW}\&

({slex=apply,head={VERB}}

;{slex=applying,head={VN_ING}}

;{slex=application,head={TION_R}}

;{slex=applicant,head={ANT_N}}

;{slex=applier,head={VN_AGENT}}

;{slex=application,head={TION_A}}

;{slex=unapplicable,head={UNABLE}}

;{slex=applicable,head={ABLE}}

;{slex=applicable,head={ELL_ABLE}}

;{slex=appliability,head={ABILITY}})\&

{sc={a={AGENT},b={THEME},c={GOAL,head={ehead={pf=to}}}},

 trans={de=({lex=applizieren};{lex=wenden,head={prf=an}}),fr={lex=appliquer}}}.

METAL Lexical Resources

Metal is a commercial MT system which is offered in English-German, English-Spanish, German-English, German-Spanish, Dutch-French, French-Dutch, French-English, German-French. It delivers monolingual and transfer system lexicons of up to 200,000 entries for a language pair, as indicated in Table 3.12. Terms are coded for morphological, syntactic, and semantic patterns, including specification of selectional restrictions. Metal offers a sophisticated subject-area code hierarchy.

Table 3.12: METAL

	All PoS
Number of Entries	200,000
Number of Senses
Senses/Entry
Morpho-Syntax	yes
Synsets	no
Sense Indicators	no
Semantic Network	no
Semantic Features	yes
- Feature Tokens	15/14
Multilingual Relations	yes
Argument Structure	yes
Selection Restrictions	yes
Domain Labels	yes
Register Labels	yes

Argument Structure

A verbal frame consists of a list of roles. A role consists of a role identifier and a description of its possible syntactic fillers. These are somewhat surface-oriented and therefore, role mapping in transfer is performed in an explicit way.

Possible role values are:


$SUBJ	deep subject

$DOBJ	deep object

$IOBJ	the affected 

$POBJ	prepositional object

$SOBJ	sentential object

$SCOMP	attribute of subject

$OCOMP	attribute of object

$LOC	locative

$TMP	temporal

$MEA	measure

$MAN	manner

Lexical semantic features

METAL has a restricted set of lexical semantic features which essentially deal with (un)definiteness of NPs, Tense and Aspect. Semantic relations are treated under the syntactic assignment of syntactic roles.

Adjectives, nouns and adverbs are semantically classified. Semantic features (attribute/value pairs) include:

TA (Type of adjective): Age, Colour, Counting, degree, Directional, Indefinite, Locative, Manner, Measurement, origin, Equantial, Shape, Size and Temporal.
TYN (Semantic type for nouns): Abstract, Animal, Concrete, Body part, Human, Location, Material, Measure, Plant, Potent, Process, Semiotic system, Social institution, Temporal, Unit of measure.

Logos Lexical Resources

Logos is a commercial high-end MT system which is offered in English-German, English-French, English-Spanish, English-Italian, English-Portuguese, German-English, German-French and German-Italian. Lexical Resources contain app. 50,000 entries for English source, 100,000 for German source, plus an additional semantic rule database with app. 15,000 rules for English source and 18,000 for German source -- as indicated in Table 3.13.

Table 3.13: LOGOS

	All PoS
Number of Entries	English 50,000
Number of Entries	German 100,000
Morpho-Syntax	yes
Synsets	no
Sense Indicators	yes
Semantic Network	yes
Semantic Features	yes
Multilingual Relations	yes
Argument Structure	yes
- Semantic Roles	yes
Semantic Frames	yes
Selection Restrictions	yes
Domain Labels	yes
Register Labels	no

Logos is based on semantic analysis techniques using structural networks. Logos encodes Logos semantic types which allow to define selectional restrictions based on syntactic patterns. Dictionaries are extendible (Logos standard dictionary comprises 250 thematic dictionaries), and the system supplies with an automatic lexicographic tool (Alex), and a semantic database (Semantha).

Systran Lexical Resources

Systran is a highly structured MT system whose translation process is based on repeated scanning of the terms in each sentence in order to establish acceptable relationships between forms. Using basic dictionaries, the system is able to define terms by analyzing morphemes (combining their grammatical, syntactic, semantic and prepositional composition).

It is a commercially available system offered with the following pairs:

English into French, German, Italian, Spanish, Polish and Dutch.
French into English, German, Italian and Dutch.

The figures supplied in Table 3.9.5 provide an indication of size and coverage.

Table 3.14: Systran

	All PoS
Number of Entries	English 95,000
Number of Entries	French 76,000
Number of Entries	German 135,000
Morpho-Syntax	yes
Synsets	no
Sense Indicators	no
Semantic Network	no
Semantic Features	yes
Multilingual Relations	yes
Argument Structure	yes
Domain Labels	yes
Register Labels	yes

Semantic Encoding

A first semantic encoding is used for syntactic parsing. Thus, verbs are marked as motion or direction verbs, and in turn encode the ``semanto-syntactic" nature of its complements, for instance:

ABSUB = Verb normally takes an abstract subject
ANSUB = Verb normally takes an animate subject
COSUB = Verb normally takes a concrete subject
HUSUB = Verb normally takes a human subject

Nouns are marked too. The inventory of labels is:

CON = Conrete
ABS = Abstract
CT = Countable
MS = Mass
HU = Human
QUAN = Quantity
TP = Time Period
AN = Animate
AMB = Animate/inanimate ambiguity
GRP = Collective Noun

Adverbs are also characterized semantically

TI = Time
PL = Place
MA = Manner
DEG = Degree
FREQ = Frequency
MODA = Modality
DIR = Direction
FUT = Future time

Besides, more semantic information is also encoded as part of a complex expression. It comprises two types: semantic primitives and terminology codes. The common attribute for both types is SEM.

Semantic primitives

These semantic categories are defined to give information on the concept behind the word. They are source-language-bound, and may be applied to any part of speech. Their main function is the selection of a proper target meaning. Semantic primitives are taxonomized, that is: each code is included in a tree structure called taxon. There are 5 taxons:


     THINGS, PROCES, LOCATN, QUALITY, BEINGS

Each of the taxons is the root of a tree which branches off to a number of subordinate nodes. For instance:

THINGS: MATERL,PLANTS; INFORM; FDPROD;DEVICE
MATERL: CMBUST, CHELEM, CHCOMP
DEVICE: CONTNR, TRANSP

Terminology codes

Terminology codes are defined to give information on the field a word is used in. They provide information of the source and correspond to the topical glossaries at target level. These codes are:

ADMIN = administrative
AGRIC = agriculture
BACTL = bacteriology
CHEMY = chemistry
CONST = construction
DPSCI = data processing
ECLNG = CEC language
ECONY = economy
JURIS = juridical
MEDIC = medical
MIBIO = microbiology
MILIT = military
TECHY = technical
TRADE = trade

Besides these terminological codes, subject field information is also supplied and is also related to the topical glossaries.

Comparison with Other Lexical Databases

Eurotra dictionaries cannot be considered as a complete semantic database to be used in real-world applications. However they should be valued because of the information contained which was agreed for all the languages involved in the project. Furthermore, the rich syntactic specification may serve as a good basis for integrating syntactic and semantic information. This is illustrated by the CAT-2 dictionaries, which have become a large multilingual database which combine translational information encoding correspondences that goes beyond the word unit and morphosyntactic information.

Relation to Notions of Lexical Semantics

Eurotra IS representation is a deep syntactic representation and not a semantic one. However, in a number of areas, attempts were made to provide an interlingual semantic solution to the translation problem. The areas which have been singled out for semantic analysis were those in which a morpho-syntactic approach proved to be insufficient to cope with the translation problems. These areas were mainly:

Tense and Aspect (§2.2)
Mood and Modality
Determination and Quantification (§2.7.4)
Negation
Aktionsart (§2.2)
Lexical Semantic Features (§2.7, 2.5.2)
Modification

Following Eurotra-D, CAT2 uses semantic relations as a basis for monolingual and bilingual disambiguation. In addition, the system suggests an extensive semantic encoding of nouns using hierarchical feature structures.

The semantic coding of nouns follows Cognitive Grammar principle [Zel88]. The semantic coding of argument roles follows Systemic Grammar [Ste88]. Support verb constructions follow the analysis of [Mes91].

Next: Experimental NLP lexicons Up: Lexical Semantic Resources Previous: Unified Medical Language System

EAGLES Central Secretariat eagles@ilc.cnr.it