next up previous contents
Next: Semantic Requirements for NL Up: Lexical Semantic Resources Previous: Experimental NLP lexicons

Subsections


   
Bilingual Dictionaries

In the final section of this chapter, two traditional bilingual dictionaries will be described that have been used in various projects to derive information for NLP or that are directly used in computerized tools. The Bilingual Oxford Hachette French and the Van Dale Dutch-English dictionaries are just illustrative for many other bilinguals that can be used in this way.

   
The bilingual Oxford Hachette French dictionary

The bilingual Oxford-Hachette French Dictionary (French-English) (OHFD) is intended for general use and is not specific to any domain. It includes most abbreviations, acronyms, many prefixes and suffixes, and some proper names of people and places in cases of unpredictable translation problems. It also includes semi- technical terms which might be found in standard texts, but omits highly domain-specific technical terms. It is designed to be used for production, translation, or comprehension, by native speakers of either English or French.
The electronic version of the OHFD is an sgml tagged dictionary. Therefore each element is tagged by function. For instance there are tags to indicate, part-of-speech, different meaning within a part of speech, pronunciation, usage, idiomatic expressions, domains, subject and object collocates, prepositions, etc. Unfortunatly, translations are not tagged at all, which makes the dictionary sometimes difficult to parse.

Size

The English-French side has about 47 539 entries (most of the compounds are entries by themselves) which are divided into: 31061 nouns, 11089 adjectives, 5632 verbs, 2761 adverbs, others 165. There are 407 top level labels, a good part of which include a distinction between British English and American English. They can be just one concept (Bio), or a combination of a concept and language level (GB Jur).

The French-English side has 38944 entries (about 10.000 compounds which are part of entries themselves) which are divided into: 25415 nouns, 8399 adjectives, 4805 verbs, 1164 adverbs and 890 others. The top level labels are about 200 labels.

Homographs

Homographs are represented in two ways:

Sense Counter

Monosemous words have no overt identifier of their single sense, as in
mineralogy:...<hg><ps>n</ps></hg> minéralogie <gr>f</gr>.</se>
This component is found only in the entries for polysemous words, as in:
migrant:...<s2 num=1><la>Sociol</la>... <s2 num=2><la>Zool</la>...
Senses are distinguished in considerable detail, although it should be remembered that in a bilingual dictionary the sense differentiation of the headword is often affected by target language (TL) equivalence. The original source language (SL) analysis of, for instance, the English word `column' would yield eight or nine senses, covering architectural, military and newspaper columns, as well as columns of figures and columns of smoke; with French as the TL, there is only one 'sense' in the `column' entry, since every sense of the English word has the French equivalent `colonne'.

Word Usage Labels

They include:

Cultural Equivalent

For certain culture-specific concepts, the source lemma does not have a direct semantic equivalent, but there is an analogous concept in the target language culture which serves as a translation, as in:
high school:...<la>US Sch</la> &appr. lycée <gr>m</gr>;...

Sense Indicators

The indicator may be a synonym or paraphrase in the form of a single word, as "information" or"data" in:
material:... (<ic>information, data</ic>) documentation...
or a phrase, as ``become proficient in" in:
master: ...(<ic>learn, become proficient in or with</ic>) mantriser <co>subject, language, controls, computers, theory, basics, complexities</co>;...
The OHFD also includes sense clue labels as additional information. The sense clue is usually a brief phrase, as ``of specified nature" or ``requiring solution" or ``on agenda" in:
matter... <s2 num=1><la>gen</la> chose <gr>f</gr>; (<ic>of specified nature</ic>) affaire <gr>f</gr>; (<ic>requiring solution</ic>) problème <gr>m</gr>; (<ic>on agenda</ic>) point <gr>m</gr>;...
These may also be used to more finely distinguish subsenses within the same substituting indicator, like ``in chess" and ``in draughts" here:
 
man :...<s2 num=7><la>Games</la> (<ic>piece</ic>) (<ic>in chess</ic>) pièce <gr>f</gr>; (<ic>in draughts</ic>) pion <gr>m</gr></s2>; ...

Subcategorisation

It includes indication of prepositional usage, subject and object collocate for semantic complementation. The OHFD attempts to show all the structures necessary if the full semantic potential of the translation equivalent(s) is to be expressed grammatically. Examples of these structures ("through, au moyen de") are to be seen in:
mediate:...diffuser <co>idea, cult</co> (<pp><sp>through</sp> au moyen
mediate:...diffuser <co>idea, cult</co> (<pp><sp>through</sp> au moyen de, par</pp>)...
Similar to this type of information is information relating to the obligatory grammatical environment of the translation word, such as ``(+subj)" in marvel:... <ls>to &hw. that</ls> s'étonner de ce que (<gr>+ subj</gr>)...

Collocators

The type of collocator which may be offered depends on the word class of the headword; in the print dictionary the following types of collocators are used (the relationship is of course with one lexical unit, i.e. a single combination of lexical component and semantic component, a monosemous word or a polysemous word in one of its senses):

Collocations

Often, even within a single semantic sense, the lemma will translate differently depending on words which appear with it. For example, in:
accident: ...[<lc>figures</lc>, <lc>statistics</lc>] se rapportant aux accidents; [<lc>protection</lc>] contre les accidents;...
the lemma should be translated as ``se rapportant aux accidents" when it appears with ``statistics", but as ``contre les accidents" when it appears with ``protection.
Collocation (tagged by <lc> in the OHFD) should not be confused with either compounds (multi-word lexemes) (tagged <cw>), which include and translate the co-occurring words as part of the lemma, nor with collocators (tagged <co>), which help to identify distinct senses of the lemma.

Multi-Word Lexeme

Multi-word lexical units occurring as sublemma forms may generate almost all the lexicographic items that normally constitute a full dictionary entry, with the exception of a phonetic transcription. The three principal types of multi-word sublemmas are:
(a) compounds, as in:
mud:...<cw>mudbank</cw>... banc <gr>m</gr> de vase; ...<cw>mud bath</cw>...(<ic>for person, animal</ic>) bain <gr>m</gr>de boue;...
(b) phrasal verbs, as in:
miss:...<pvp><lp>&hw. out</lp> être lésé; ... (c) idioms, as in:
miss:...<id><li>to &hw. the boat <u>ou</u> bus</li>&coll. rater le coche</id>;...
Multi-word lexemes may range from fixed phrases (e.g. ``by and large") through idiomatic verb phrases ("raining cats and dogs") to whole sentences, such as proverbs or phatic phrases ("Have a nice day").

Gloss

This is given when there is no direct TL equivalent of the lemma, as in:
mid-terrace:...[<lc>house</lc>, <lc>property</lc>] <gl>situé au milieu d'un alignement de maisons identiques et contiguës</gl>...

  
Van Dale Bilingual Dutch-English

The Van Dale Bilingual Dictionaries are developed for native Dutch speakers. This means that the resources contain only very limited information on the Dutch words and much more information on the Foreign-Language target words. The Dutch-English dictionary is described here [Mar86].

The Van Dale Dutch-English is a traditional bilingual dictionary. It is structured using a tagged field structure, which makes it relatively easy to extract a computer tractable database from it. However, the scope of the fields is not always explicit and the values within the fields are often undifferentiated and consist of free-text.

The entry-structure is homograph-based but homographs are distinguished only when the part-of-speech differs and/or the pronunciation. Sub-homographs are used when senses differ in major grammatical properties such as valency, countability, predicate/attributive usage. The figures supplied in Table 3.17 provide an indication of size and coverage.

 
Table 3.17: Number of Entries, Senses and Translations in the Van Dale Dutch-English Dictionary
Entries 90925
Homographs 2967
Sub-homographs 6769
Senses 127024
Main Translations 145511
Secondary Translations 104181
Examples 111226
 

In addition to some grammatical information on the Dutch words and the English translations, the dictionary contains a large amount of semantic information restricting the senses and the translations:

Sense-indicators
(53368 tokens) to specify the Dutch senses or polysemous entries. These contain bits and pieces from original definitions (often a genus word).
Biological gender marker
for English translations. This is necessary to differentiate translations when the source and target language have different words for male or female species: 286 translations are labeled as male, 407 translations as female.
Usage labels for domain, style and register
Applies to both Dutch senses and their English translations.
Dialect labels
for Dutch senses and their English translations
Context markers
(23723 tokens, 16482 types). These are semantic constraints differentiating the context of multiple translations, and to limit the scope of translations having a narrower context than the Dutch source sense.

The usage labels and the domain labels are mostly stored in the same field. Differentiation has to be done by some parsing. The usage labels form a limited closed set of abbreviations and codes, the domain labels are free text. For the main-translations about 400 different types of values occur.

The context markers and sense-indicators are in textual form (Dutch). Their interpretation varies from domain-labels, selection restrictions, semantic classifications and semantic properties. The difference in interpretation is however not coded but can partially be inferred, e.g.:

Finally, a lot of collocational information is stored in the examples and their translations. The typical combination words are marked and distinguished per part-of-speech. If the combination is compositional then the correlating meaning of the entry is given, in the case of idiosyncratic collocation there is a mark. The examples and their translations can be seen as partially structured context specification for the Dutch and English word pairs.

   
Relations to Notions of Lexical Semantics

Bilingual resources often contain information which disambiguate the usage of words in the target languages, but only in so far it is necessary to select the correct alternative. The information takes the form of semantic classes, selection restrictions, register and domain labels or morpho-syntactic information, but it requires considerable processing to differentiate between them. Somewhat more sophisticated information is available in the form of the examples and the translations of the examples. The combinatoric constraints provide very useful information comparable to Mel'cuk's lexical semantic functions [Mel89].

   
LE Uses

Obviously, bilingual dictionaries are useful input for constructing Machine-Translation systems, although a lot of additional work has to be carried out to formalize and complete the information. In Acquilex, it has neverthless been used to automatically extract equivalence-relations between English and Dutch word-senses (see [Cop95b]).

Because elements are tagged by functions the OUP is a very convenient dictionary to retrieve information from. The Oxford Hachette has been successfully used to design an intelligent dictionary lookup, Locolex, first developed in the framework of the COMPASS European project.

Furthermore, [Mak95] show that it is possible to achieve high degree of sense-disambiguation using the rich annotations in the examples and their translations in bilingual dictionaries.



next up previous contents
Next: Semantic Requirements for NL Up: Lexical Semantic Resources Previous: Experimental NLP lexicons
EAGLES Central Secretariat eagles@ilc.cnr.it