Next: Information Retrieval Up: Areas of Application Previous: Areas of Application

Subsections

Machine Translation

Introduction

Our survey focusses on four types of systems:

Interlingual MT;
Knowledge-based MT (KBMT) ;
Language-based MT (LBMT), and
Example-based MT (EBMT).

Survey

Interlingual MT

As an illustrative example of this approach, we report on work done by Bonnie Dorr and the system PRINCITRAN [Dor93,Dor95a,Dor95b]. They use an automatic classification of verbs using a rich semantic typology based on Levin's work [Lev93]. They defined thematic grids based on a more fine-grained verbal classification. Lexical entries for SL and TL include thematic roles. Thematic roles are necessary for parsing, for building an interlingual semantic representation. For instance argument positions are associated with a wider variety of semantic roles, i.e. intransitives are not uniformly marked `ag' (agent), but may be marked `th' (theme), depending on the real logical role of the argument. Hypothesis used: semantically related verbs have similar, or identical thematic grids, i.e. verbs of the same semantic class have the same thematic specification. Dorr experimented with methods combining LDOCE codes (see §3.2) with Levin's verb classes.

KBMT

To be successful it requires a large amount of hand-coded lexical knowledge. The cost-effectiveness of this task can be alleviated by partial automation of the knowledge acquisition process for the build up of concept dictionaries, see [Lon95] with reference to the KANT system. Terminological coverage crucial. KANT is interlingua based. They identify a set of domain semantic roles related with prepositions, and then potentially ambiguous role assignment. So, the distinguished semantic roles are associated with specific syntactic patterns (these two steps are done manually). KBMT seeks to translate from the meaning of the source text derived with the help of a model of background knowledge (see [Car87]). The semantic analysis derived from the source string is augmented by information from the ontology and domain models. The success of KBMT depends on the following: having a good model of conceptual universals of the application domain across cultures and a linguistic model which can describe the mapping between linguistic form and meaning. The use of a model of background knowledge and the lack of a principled methodology for conceptual modelling present the major difficulty in this approach. Another way of using the background knowledge in MT is proposed by ATR's transfer-driven MT, where translation is performed at the linguistic level rather than the conceptual level as in KBMT.

LBMT

Language-based MT has two subtypes of systems: lexeme-oriented and grammar-oriented. The modern descendants of lexeme-oriented approaches assume that translation can be captured by lexical equivalence and units of translation are words and specific phrases (see [Tsu93]). The grammar-oriented approach considers the structural properties of the source text in disambiguation and translation but confines itself to intra-sentential structures. Its advantage over the lexeme approach can be seen in two respects: the concept of translation equivalence exists not only in terms of source strings but also in the structure of the source strings. The unit of translation is extended to the sentential scope. The problem with both types of LBMT is that they do not give much attention to the context of language use. The context can be textual, such as discourse organisation, social.

EBMT

Example-based machine translation proposes translation by examples collected from past translations. The examples are annotated with surface descriptions, e.g. specific strings, word class patterns, word dependencies (see [Sat92]), predicate frames (see [Jon96]) for the purpose of combining translated pieces. Transfer is effected on similarity in terms of these descriptions. The knowledge used for transfer is specific rather than general. The immediate criticism about specific knowledge representation is that its reliance on concrete cases limits the coverage of the knowledge model. EBMT builds its transfer on approximate reasoning i.e. there is no single correct translation, but degrees of acceptability among renditions. EBMT finds the translation of an s-string by its probability of translations calculated on the basis of a corpus of aligned bilingual texts. The best rendering of a source string is found by comparing it with examples of translation. The key words are compared by calculating their distance according to a thesaurus (see [Fur92a,Fur92b]).

Role of Lexical Semantics

Lexical ambiguity resolution is one of the main problems of MT. This is a problem for the analysis level and the transfer module. One way to tackle this problem is to take into account domain knowledge, i.e. a tree in computer science is an abstract structure rather than a large plant (thus translated as arborescence and not arbre in French. Nevertheless, even within the same domain, we encounter lexical ambiguities. Consider the following sentence from computer science: All of these expressions are typed. Does it mean that the expressions all have types or that they are all entered through the keyboard? We will examine further this issue in section 5.2.3.

Selectional restrictions are another way to resolving lexical ambiguity in MT. The role of lexical semantics is clear in this case for MT. That is words enter into certain syntactic relations only with words and phrases of particular semantic properties. For structural ambiguity resolution, we use subcategorization frames, semantic features and semantic roles. Semantic roles have been used in many MT systems who were influenced by a case grammar,valency, or dependency grammar formalism. For anaphora resolution, we use co-occurrence restrictions to see which antecedent is more plausible. Semantic features are used.

Interlingua systems have to distinguish between different concepts, thus the use of lexical semantics in terms of the use of conceptual structure is primordial. Interlingua systems have to refer to real world knowledge and to have a semantic representation fine-grained enough to capture every distinction in the language concerned. We refer the reader to the well known example from Japanese, where one has to distinguish between sixteen different 'WEAR' concepts, on the basis of knowledge about different types of apparel, social, cultural differences, etc. Semantic roles, such as deep-case roles may be provided in a logic formalism (loosely based on predicate logic). As an example, work by Alshawi et al. [Als89], use an interlingua representation based on Quasi-Logical Forms. Ontological information is used in EBMT, KBMT. In EBMT, as we discussed in the survey, the definition of appropriate similarity metrics, may be computed from the topology of a concept hierarchy in a thesaurus. EBMT uses examples annotated with word class patterns [Fur92a,Fur92b], predicate frames, word dependencies, for the purpose of combining translated pieces. In KBMT, they use background knowledge [Nir89] to translate the meaning of the source text. The semantic analysis derived is augmented by information from the ontology (of objects, event-types, relations, properties) and domain models. An example of lexical disambiguation in LBMT (Eurotra), follows in the subsection below.

Related Areas and Techniques

Terminological information for the construction of lexica and for lexical and structural ambiguity resolution is very important. Thus, techniques such a term extraction (see below) are very useful in all types of MT. EBMT paradigm relies on word clustering (similarity measures) techniques. EBMT and statistical-based MT use word sense disambiguation techniques, based on parallel aligned corpora.

Lexical Disambiguation in Eurotra

Eurotra's Interface Structure (IS) representation is not a semantic representation. However, in a number of areas, attempts were made to provide an interlingual semantic solution to the translation problem. Essentially the semantic information encoded in E-dictionaries is used for disambiguation purposes. These include (i) structural ambiguity, (ie. argument modifier distinction, essentially PP-attachtment) and (ii) lexical ambiguity in lexical transfer, that is collocations (restricted to verb support constructions) and polysemy.

Lexical ambiguity: Collocations (Verb Support)

Treatment of collocations within Eurotra is limited to the support verb constructions. Support verbs in Eurotra are associated to predicative nouns. A predicative noun is a noun which has an argument structure with, at least, a deep subject argument. A predicative noun can appear in a structure of the kind: Det-def N [SUBJ-GEN Na] (Prep Nb) (Prep Nc)

The SUBJ-GEN stands for the subjective genitive expressed in English by the preposition 'of' or as a preposed genitive:

: the attack of the enemies
: the enemies' attack

It is understood that for every predicative noun there is a non-empty set of of support verbs (Vsup) such that the following conditions hold:

the SUBJ-GEN of the predicative N is the subject of the correponding Vsup.

the enemies made an attack
The predicative noun may take a relative clause contaning the support verb:

the attack which the enemies made

Following [Gro81], in Eurotra the main function of the support verbs is to provide with tense, number and person information. Support verbs are understood to be semantically contentless. Also following Gross, in Eurotra support verbs are distinguished by means of their aktionsart. Here is the only case where a 'euroversal' semantic classification is followed:

: neutral: John HAS influence over Mary
: inchotive: John GAINS influence over Mary
: durative: John RETAINS influence over Mary
: iterative: John REPEATED his attacks against Mary
: terminative: John LOST influence over Mary

Support verb constructions are given a 'raising verb' representation. The entry for a predicative nouns such as 'influence' bears information about the kinds of support verbs it demands:


eg: 'influence':

{cat=n, predic=yes, is_frame=arg_12, pform_arg1=over, svneut=have,

svincho=gain, svdur=keep, avterm=lose, sviter=none}

Lexical Ambiguity: polysemy

Lexical Semantic Features (LSF) are present at IS because they are used to distinguish readings in analysis and in generation. There are two different approaches to LSFs, (i) Universal Semantic Feature theory, and (ii) Restricted Semantic Feature theory. The former provides with a set of universal hierarchically organized fine-grained features. The latter provides with a set of language specific features.

In the USF approach, attribute/value pairs must be assigned in identical way in all monolingual distionaries. Lexical transfer is performed from lexical unit to lexical unit with unification of all semantic features:


        eg: {lu=computer, human=no}     <=>     {lu=ordinateur, human=no}

            {lu=computer, human=yes}    <=>     {lu=elaboratore, human=yes}

Eurotra legislation follows the RSF approach. So, lexical semantic features are language specific. They are not part of the information that is transferred but they are an important part of the dictionary. Lexical semantic features are essentially monolingual and each language group is free to choose them. In Spanish, these include:


semarg1/2/3/4 = parti,med,tiem,coord,prop,cual,abt,rel,co_em,res,act,n_ag,pers,

      org,lug,no_hum,instr,masa,fu,nat,parte,todo,ent,sem,no_sem,abs,esc,

      no_parti, sit,est,no_prop,din,conc,anim,ag,agde,hum,no_lug,ind,

      no_anim,cont,art,nil.

There is one entry for every reading. Readings are distinguished by means of a identificational feature (isrno). Lexical transfer is performed from reading to reading by means of reading numbers. LSFs are not transferred. There may be different readings in all levels:


        at the ECS level:'walk'

        {lu=walk, ecsrno=1, cat=v...}

        {lu=walk, ecsrno=2, cat=n...}



        at the IS level: 'adopt'

        {lu=adopt, isrno=1, rsf_human_of_arg1=yes, rsf_human_of arg2=yes...}

        {lu=adopt, isrno=2, rsf_human_of_arg1=yes, rsf_human_of arg2=no...}

For the 'computer' example above, lexical transfer goes from reading to reading, despite the semantic features involved in each entry:


eg: {lu=computer, isrno=1, human=no}    <=> {lu=ordinateur, isrno=1, human=ho}

    {lu=computer, isrrno=2, human=yes}  <=> {lu=elaboratore, isrrno=1, human=yes}

A note on Eurotra-D:

Eurotra allows further subtyping IS with semantic relations (SR) [Ste88]. The 'Development of the EUROTRA-D System of Semantic relations' by Steiner, Eckret, Weck and Winter aims at automizing to a large extend the compilation of transfer dictionaries in order to reduce the work and to be more independent of bilingual knowledge. EUROTRA-D suggests a detailed monolingual classification of verbs according to a classification system based on syntactic and semantic criteria. The idea is that the resulting description of verbs makes automatic mapping from one lexeme in one language onto a lexeme in another language. The encoding of different readings of a verb is used to partly automize the compilation of transfer dictionaries.

Essentially, EUROTRA-D assumes four major semantic Predicate Types and a set of subtypes:

Relational: locational, associative, classificatory, identifying, existential
Mental:Precessor oriented, phenomenon oriented, two phenomena, phenomenon only
Communication: Sender+message, sender+receiver
Action: natural phenomenon, agent only, affected only,attribuant only, agent centered, affected centered

Each Predicate Type assigns a unique set of SRs to its subcategorized elements. Thus, for instance, the SRs assigned by Communication type predicates include: message, sender, receiver and promt.

SRs serve to distinguish readings, so two readings are different iff their associated sets of SRs are distinct in either, number, type or canonical order (the order of SRs is given by the order of their syntactic realization).

SRs are used for lexical disambiguation/transfer purposes. At IS representation, the ARG2 of move like verbs occur with different syntactic and semantic constituents:

: The director moved the firm
: The director moved to Europe
: The firm moved to Europe

For some languages, also the ARG1 has different semantic 'constituens':

: The director moved
: The firm moved

At IS level, we would have three readins which hardly distinguish examples above:

: move1= ...frame=ARG1,ARG2,ARG3
: move2= ...frame=ARG1,ARG2
: move3= ...frame=ARG1

Enriched IS representations allow to obtain a hihger number of readings:


move_1= frame=  ARG1(SR=3RDPARTYAGENT),ARG2(SR=ATTRIBUANT),

                ARG3=(SR=LOCATION)

                'the director moved the firm to Europe'



move_2= frame=  ARG1(SR=3RDPARTYAGENT),ARG2(SR=ATTRIBUANT),

                'the director moved the firm'



move_3= frame=  ARG1(SR=AGENT_ATTRIBUANT), ARG2(SR=LOCATION)

                'the director moved to Europe'



move_4= frame=  ARG1(SR=AFFECTED-ATTRIBUANT), ARG2(SR=LOCATION)

                'the firm moved to Europe'



move_5= frame=  ARG1(SR=AFFECTED-ATTRIBUANT)

                'the firm moved'

ET-10/75 Project: (collocations)

The ET-10/75 project centers on the use of Mel'cuk's Lexical Functions (LFs) as interlingua representations in MT dictionaries for a subset of collocations (adj-noun and verb-noun). The project is implemented in ALEP.

Collocations constitute an important challange for MT. There is no simple way of mapping collocations of one language onto equivalent expressions of another.

To cope with these mismatches we can (i) add information in bilingual dictionaries so that, for example, English 'pay' translates to Spanish 'prestar (eg, 'pay attention' 'prestar atenci=F3n). The problem here is that we have to specify in which contexts 'pay' and 'prestar' are equivalent. (ii) we can list collocations giving the corresponding expression in the target language. (iii) we can adopt an interlingua approach.

Following Mel'cuk, the ET-10/75 project suggests for a interlingua approach to collocations. Mel'cuk works in Meaning Text theory and provides with Extended Combinatory Dictionaries (Russian and French).

ET-10/75 has a wide notion of collocation. Collocations are compositional and semantically transparent (the meaning of the whole reflects the meaning of the parts), 'frequent', allow material in between (eg, 'pay little attention', cannot be syntactically or semantically predicted, may allow syntactic 'processes' (eg. passivization, extractions etc) ...

The kind of Collocations they deal about are (lexical collocations):

: N of N: flock of SHEEP
: Adj N: narrow SCOPE
: V N: commit MURDER

The syntactic relations in collocations involve head/modifier (N-of-N and Adj-N) or head/argument relations (V-N).

In both cases, the N is the base and selects the collocate. This means that for head/modifier collocations the head (N) selects the collocate/modifier and for head/argument collocations the argument (N) selects the head/collocate.

Lexical Functions (Mel'cuk)

LF are used to systematically describe certain semantic and collocational relations existing between lexemes. They apply at the deep syntactic level of the Meaning Text model.

A LF is a relation between an argument and a value, where the value is a linguistic expression (the 'collocate' in the examples above) which expresses the meaning which corresponds to the LF. (Mel'cuk suggests for 60 LFs).

LF can be 'semantically' classified:

: evaluative qualifiers: eg. Bon(cause)=worthy
: distributional quant: eg. Mult(sheep)=flock, Sing(rice)=grain
: involving preposition: eg. Loc(distance)=at
: involving V operators: eg. Oper(attention)=pay

LFs are further refined by using subscripts and superscripts and extended forming compound LFs.

ET-10/75 investigates to what extend LFs (the 'semantic' relations between base/collocates) 'follow' certain syntactic rules. It seems that for each kind of 'syntactic' collocation there is a specific set of LF so that certain LFs apply on certain collocations. They study whether certain LF only apply to particular syntactic categories and only outputs particular categories (for instance Loc outputs prepositions).

Arguments and values of LF are uninflected lexemes. (they stand out the advantage of dealing with lemmatized texts when looking for collocations and dealing with 'lemmatized' LFs in order to 'simplify' the system).

Translation of Collocations

When a collocation of one language corresponds to a collocation in the target language things are easy. Problems arise when there are mismatches:

lexicalization: collocation = single unit
ET-10/75 suggests for 'merged' LF: the application of a LF in one language corresponds to the application of its merged counterpart in the other (eg. Antibon(interpret)=misinterpret == Antibon(interpretieren)=falsch interpretieren).
non-default translation: one = many as in the English/Spanish examples:

bunch of grapes - racimo de uvas
bunch of bananas - pi=F1a de pl=E1tanos
bunch of key - manojo de llaves

In these cases a 'default' correspondance is not possible and pair correspondances are explicitly specified.
regular compound formation (eg, German resorts to compounds where English resorts to collocations). In ET-10/75 collocations and compounds are given the same semantic form. This allows approaching Collocation/Compounds mismatches are approached via 'merged' LF (eg, merged LF = non-merged LF.
collocation = non-collocation. No LF analysis is possible.
collocation = paraphrase mismatches are approached in terms of a 'Paraphrasing System'. The 'Paraphrasing System' which consists of: (i) a set of (60) lexical paraphrasing rules of the kind Causative = Decausative, contains = belongs ... so that when in one language there is no Causative constructions, decausative constructions is produced. (ii) syntactic paraphrasing rule: the ones which tale care of the 'changes' caused by lexical substitutions invoked by lexical PRs.

Next: Information Retrieval Up: Areas of Application Previous: Areas of Application

EAGLES Central Secretariat eagles@ilc.cnr.it