next up previous contents
Next: Experimental NLP lexicons Up: Lexical Semantic Resources Previous: Unified Medical Language System


Lexicons for Machine-Translation

In this section we discuss several computerized lexicons that have been developed for Machine Translation applications: Eurotra, Cat-2, Metal, Logos and Systran (see §4.1). They have a high degree of formalization as compared to traditional dictionaries but the information is specifically structured to solve translation problems.

Eurotra Lexical Resources

Eurotra is a transfer based and syntax driven MT system which deals with 9 languages (Danish, Dutch, German, Greek, English, French, Italian, Spanish and Portuguese). Monolingual and Bilingual Lexical resources were developed for all the languages involved, size and coverage of those were similar for all.

We will only supply figures for Spanish as an illustration in Table 3.10.

Table 3.10: EUROTRA - Spanish
  All PoS Nouns Verbs Adjectives Adverbs Other
Number of Entries 2193 941 444 465 269 yes
Number of Senses 2881 1322 740 396 270 yes
Morpho-Syntax yes          
Synsets no          
Sense Indicators yes          
- Indicator Types 1          
Semantic Network no          
Semantic Features yes          
- Features 41          
Multilingual Relations yes          
Argument Structure yes          
Selection Restrictions yes          
Domain Labels no          
Register Labels no          

Eurotra dictionaries are organized according to a number of levels of representation: Eurotra Morphological Structure (EMS), Eurotra Constituent Structure (ECS), Eurotra Relational Structure (ERS) and Interface Structure (IS). The IS is the basis for transfer and although it reflects deep syntactic information it is also the level were semantic information is present.

The Eurotra IS level is an elaboration of dependency systems in that every phrase is made up of a governor optionally followed by dependants of two types: arguments and modifiers. Arguments of a given governor are encoded in the lexicon. The relations between governors and their arguments are not explicitly stated. The set of arguments, are:

arg1:       subject (experiencer/causer/agent)
arg2:       object (patient/theme/experiencer)
arg_2P:	    2nd participant (goal/receiver (non-theme))
arg_2E:	    2nd entity (goal/origin/place (non-theme))
arg_AS:	    secondary stative predication on subject		
arg_AO:	    secondary stative predication on object	
arg_Pe:	    dative perceiver with raising predicates
arg_ORIGIN: oblique		
arg_GOAL:   oblique
arg_MANNER: oblique

Not all labels have the same theoretical status nor correspond to the same level of depth in analysis. Thus,

Essentially the semantic information encoded in E-dictionaries is used for disambiguation purposes. These include (i) structural ambiguity, (ie. argument modifier distinction, specially in the case of PP-attachment) and (ii) lexical ambiguity in lexical transfer, that is collocations (restricted to verb support constructions), homonymy and polysemy (this is further explained in §4.1.4).

All information is encoded as Feature-Value pairs, in ASCII files. Here is an example:

absoluto_1 =

Information encoded depends on the category. For all categories, the category (cat=), lema (e_lu=) and reading number (e_isrno=) is encoded. Other information is, for nouns and verbs: deep syntactic argument structure (e_isframe) as explained above, argumental strongly bound prepositions required by the lexical item (e_pformargX), selectional restrictions for all the arguments (semargX=) and the semantic type of the lexical item (sem=).

Reading number refers to a meaning distinction usually also reflected in a difference in the encoding of the other atributes. In the case of centro (``center") the meaning distinction is referred to in the ``sem" attribute" is: coordinate vs. place (lug). Besides, the reading ``place" has no argumental structure while the reading ``coord" can have one argument (``e_isframe=arg1"), and this has to be ``concrete" in oposition to ``abstract entity".

centro_1 =
centro_2 =

Other strictly monolingual information encoded for nouns is: gender, person, type of noun (``nform" and ``nclass"), if it requires a specific verbal mood (exig_mood) when creating a subordinate clause, if the noun is predicative (``e_predic"), information about relatives (wh and whmor), morphological derivative information (e_morphsrce, refers to morphological source, i.e, derivate...), and terminological identification: ``term".

basar_1 =

As said before, information about strongly bound prepositions is encoded for all the arguments, and in case the verb preposition is weakly bound, 2 features corresponding to 2 possible complements might refer to a class of prepositions such as ``origin", ``goal", etc. As for nouns, selectional restrictions are encoded but no semantic typing of the verb itself. Specific monolingual information is encoded in the following attributes: ``e_vtype", refers to the traditional main vs. auxiliar distinction, and ``erg" refers to ergative verbs. Aspectual characterization of the verb is encoded in ``vfeat", with possible values stative, non stative.

CAT-2 Lexical Resources

The CAT2 system, developed at IAI (Saarbruecken), is a direct descendant of Eurotra and was designed specifically for MT [Ste88], [Zel88], [Mes91]. The CAT2 system exploits linguistic information of different kinds: phrase structure, syntactic functional information and semantic information. The figures supplied in Table 3.11 provide an indication of size and coverage.

Table 3.11: CAT-2
  All PoS
Number of Entries German 20000
Number of Entries English 30000
Number of Entries French 30000
Number of Senses German 40000
Number of Senses English 50000
Number of Senses French 50000
Senses/Entry German 2
Senses/Entry English 1.6
Senses/Entry French 1.25
Morpho-Syntax yes
Synsets no
Sense Indicators no
Semantic Network no
Semantic Features yes
- Feature Types 60
Multilingual Relations yes
Argument Structure yes
- Semantic Roles 7
Semantic Frames no
Selection Restrictions yes
Domain Labels yes
- Domain Tokens 5000
Register Labels yes
- Register Tokens 4

Semantic information is essentially used for reducing syntactic ambiguity, disambiguation of lexical entries, semantic interpretation of prepositional phrases, support verb constructions, lexical transfer and calculation of tense and aspect.

A verbal entry for the IS level example is:

apply1  =
%% He applied the formula to the problem.

METAL Lexical Resources

Metal is a commercial MT system which is offered in English-German, English-Spanish, German-English, German-Spanish, Dutch-French, French-Dutch, French-English, German-French. It delivers monolingual and transfer system lexicons of up to 200,000 entries for a language pair, as indicated in Table 3.12. Terms are coded for morphological, syntactic, and semantic patterns, including specification of selectional restrictions. Metal offers a sophisticated subject-area code hierarchy.

Table 3.12: METAL
  All PoS
Number of Entries 200,000
Number of Senses  
Morpho-Syntax yes
Synsets no
Sense Indicators no
Semantic Network no
Semantic Features yes
- Feature Tokens 15/14
Multilingual Relations yes
Argument Structure yes
Selection Restrictions yes
Domain Labels yes
Register Labels yes

Argument Structure

A verbal frame consists of a list of roles. A role consists of a role identifier and a description of its possible syntactic fillers. These are somewhat surface-oriented and therefore, role mapping in transfer is performed in an explicit way.

Possible role values are:

$SUBJ	deep subject
$DOBJ	deep object
$IOBJ	the affected 
$POBJ	prepositional object
$SOBJ	sentential object
$SCOMP	attribute of subject
$OCOMP	attribute of object
$LOC	locative
$TMP	temporal
$MEA	measure
$MAN	manner

Lexical semantic features

METAL has a restricted set of lexical semantic features which essentially deal with (un)definiteness of NPs, Tense and Aspect. Semantic relations are treated under the syntactic assignment of syntactic roles.

Adjectives, nouns and adverbs are semantically classified. Semantic features (attribute/value pairs) include:

Logos Lexical Resources

Logos is a commercial high-end MT system which is offered in English-German, English-French, English-Spanish, English-Italian, English-Portuguese, German-English, German-French and German-Italian. Lexical Resources contain app. 50,000 entries for English source, 100,000 for German source, plus an additional semantic rule database with app. 15,000 rules for English source and 18,000 for German source -- as indicated in Table 3.13.

Table 3.13: LOGOS
  All PoS
Number of Entries English 50,000
Number of Entries German 100,000
Morpho-Syntax yes
Synsets no
Sense Indicators yes
Semantic Network yes
Semantic Features yes
Multilingual Relations yes
Argument Structure yes
- Semantic Roles yes
Semantic Frames yes
Selection Restrictions yes
Domain Labels yes
Register Labels no

Logos is based on semantic analysis techniques using structural networks. Logos encodes Logos semantic types which allow to define selectional restrictions based on syntactic patterns. Dictionaries are extendible (Logos standard dictionary comprises 250 thematic dictionaries), and the system supplies with an automatic lexicographic tool (Alex), and a semantic database (Semantha).

Systran Lexical Resources

Systran is a highly structured MT system whose translation process is based on repeated scanning of the terms in each sentence in order to establish acceptable relationships between forms. Using basic dictionaries, the system is able to define terms by analyzing morphemes (combining their grammatical, syntactic, semantic and prepositional composition).

It is a commercially available system offered with the following pairs:

The figures supplied in Table 3.9.5 provide an indication of size and coverage.
Table 3.14: Systran
  All PoS
Number of Entries English 95,000
Number of Entries French 76,000
Number of Entries German 135,000
Morpho-Syntax yes
Synsets no
Sense Indicators no
Semantic Network no
Semantic Features yes
Multilingual Relations yes
Argument Structure yes
Domain Labels yes
Register Labels yes

Semantic Encoding

A first semantic encoding is used for syntactic parsing. Thus, verbs are marked as motion or direction verbs, and in turn encode the ``semanto-syntactic" nature of its complements, for instance:

Nouns are marked too. The inventory of labels is:

Adverbs are also characterized semantically

Besides, more semantic information is also encoded as part of a complex expression. It comprises two types: semantic primitives and terminology codes. The common attribute for both types is SEM.

Semantic primitives

These semantic categories are defined to give information on the concept behind the word. They are source-language-bound, and may be applied to any part of speech. Their main function is the selection of a proper target meaning. Semantic primitives are taxonomized, that is: each code is included in a tree structure called taxon. There are 5 taxons:

Each of the taxons is the root of a tree which branches off to a number of subordinate nodes. For instance:

Terminology codes

Terminology codes are defined to give information on the field a word is used in. They provide information of the source and correspond to the topical glossaries at target level. These codes are: Besides these terminological codes, subject field information is also supplied and is also related to the topical glossaries.

Comparison with Other Lexical Databases

Eurotra dictionaries cannot be considered as a complete semantic database to be used in real-world applications. However they should be valued because of the information contained which was agreed for all the languages involved in the project. Furthermore, the rich syntactic specification may serve as a good basis for integrating syntactic and semantic information. This is illustrated by the CAT-2 dictionaries, which have become a large multilingual database which combine translational information encoding correspondences that goes beyond the word unit and morphosyntactic information.

Relation to Notions of Lexical Semantics

Eurotra IS representation is a deep syntactic representation and not a semantic one. However, in a number of areas, attempts were made to provide an interlingual semantic solution to the translation problem. The areas which have been singled out for semantic analysis were those in which a morpho-syntactic approach proved to be insufficient to cope with the translation problems. These areas were mainly:

Following Eurotra-D, CAT2 uses semantic relations as a basis for monolingual and bilingual disambiguation. In addition, the system suggests an extensive semantic encoding of nouns using hierarchical feature structures.

The semantic coding of nouns follows Cognitive Grammar principle [Zel88]. The semantic coding of argument roles follows Systemic Grammar [Ste88]. Support verb constructions follow the analysis of [Mes91].

next up previous contents
Next: Experimental NLP lexicons Up: Lexical Semantic Resources Previous: Unified Medical Language System
EAGLES Central Secretariat