Next: Semantic Requirements for NL Up: Lexical Semantic Resources Previous: Experimental NLP lexicons

Bilingual Dictionaries

In the final section of this chapter, two traditional bilingual dictionaries will be described that have been used in various projects to derive information for NLP or that are directly used in computerized tools. The Bilingual Oxford Hachette French and the Van Dale Dutch-English dictionaries are just illustrative for many other bilinguals that can be used in this way.

The bilingual Oxford Hachette French dictionary

The bilingual Oxford-Hachette French Dictionary (French-English) (OHFD) is intended for general use and is not specific to any domain. It includes most abbreviations, acronyms, many prefixes and suffixes, and some proper names of people and places in cases of unpredictable translation problems. It also includes semi- technical terms which might be found in standard texts, but omits highly domain-specific technical terms. It is designed to be used for production, translation, or comprehension, by native speakers of either English or French.
The electronic version of the OHFD is an sgml tagged dictionary. Therefore each element is tagged by function. For instance there are tags to indicate, part-of-speech, different meaning within a part of speech, pronunciation, usage, idiomatic expressions, domains, subject and object collocates, prepositions, etc. Unfortunatly, translations are not tagged at all, which makes the dictionary sometimes difficult to parse.

Size

The English-French side has about 47 539 entries (most of the compounds are entries by themselves) which are divided into: 31061 nouns, 11089 adjectives, 5632 verbs, 2761 adverbs, others 165. There are 407 top level labels, a good part of which include a distinction between British English and American English. They can be just one concept (Bio), or a combination of a concept and language level (GB Jur).

The French-English side has 38944 entries (about 10.000 compounds which are part of entries themselves) which are divided into: 25415 nouns, 8399 adjectives, 4805 verbs, 1164 adverbs and 890 others. The top level labels are about 200 labels.

Homographs

Homographs are represented in two ways:

As separate entries. (We shall henceforth use ``homograph" to denote these separated entries.). In this case, a number will be included with the headword lemma to distinguish it from its homographs, as in:
row<hm>1</hm>...<ph>r@U</ph>...(<ic>line</ic>)...
row<hm>2</hm>...<ph>raU</ph>... (<ic>dispute</ic>)...;
As major subdivisions within a single entry. (We shall henceforth use ``grammatical category" to denote such subdivisions.) In the OHFD, these are labelled with roman numerals, as in:
row<hm>1</hm>...<ph>r@U</ph>...
<s1 num=I><ps>n</ps> ...(<ic>line</ic>)...
<s1 num=III><ps>vi</ps> <ann><la>Naut</la>,
<la>Sport</la>...
The basis for deciding when to assign separate homograph entries to identically spelt words is a difference of pronunciation, not semantic or etymological independence. In addition, function words are given a separate entry from homographs even of the same pronunciation, as in:
mine<hm>1</hm>...<ph>maIn</ph>...<fi>le mien</fi>, <fi>la mienne</fi>...
mine<hm>2</hm>...<ph>maIn</ph>...<la>Mining</la> mine <gr>f</gr>...

Sense Counter

Monosemous words have no overt identifier of their single sense, as in
mineralogy:...<hg><ps>n</ps></hg> minéralogie <gr>f</gr>.</se>
This component is found only in the entries for polysemous words, as in:
migrant:...<s2 num=1><la>Sociol</la>... <s2 num=2><la>Zool</la>...
Senses are distinguished in considerable detail, although it should be remembered that in a bilingual dictionary the sense differentiation of the headword is often affected by target language (TL) equivalence. The original source language (SL) analysis of, for instance, the English word `column' would yield eight or nine senses, covering architectural, military and newspaper columns, as well as columns of figures and columns of smoke; with French as the TL, there is only one 'sense' in the `column' entry, since every sense of the English word has the French equivalent `colonne'.

Word Usage Labels

They include:

Stylistic Label: An example of a style label is ``littér" (literary) in
milling:...<la>littér</la> <co>crowd</co> grouillant...
Other items may be marked as belonging to administrative, or technical, or poetic language.
The OHFD also marks formal terms as ``fml" (French ``sout" for ``soutenu"). However, in order to avoid terms such as ``colloquial" or ``familiar", which are open to individual interpretation, the OHFD has devised a system of symbols showing points on a scale from (roughly) ``informal" to ``informal, may be taboo" .
The presence of these labels, which may be attached to SL and to TL items, allows both the decoder and the encoder to align words of similar style in the two languages.
Diatechnical Label: Semantic domain labels are exceedingly frequent in the OHFD, as may be seen in:
matrix:...<la>Anat</la>, <la>Comput</la>, <la>Ling</la>, <la>Math</la>,
<la>Print</la>, <la>Tech</la> matrice <gr>f</gr>; <la>Miner</la> gangue
<gr>f</gr>...
Moreover, the dictionary tapes actually contain a more comprehensive coverage of semantic domain labels than appears in the printed text. Whenever the lexicographers believed that the lemma was frequently (or exclusively) used when a particular subject was under discussion, they marked the lexical unit with an appropriate semantic domain marker.
Diatopic Label: This type of labelling is used in the print dictionary to mark such regional varieties as Belgian French, Swiss German and American English, as ``US" in
math:... <la>US</la> =<xr><x>maths</x></xr>...
or ``GB" in
bedroom:...<le>a two &hw. flat <la>GB</la> ou apartment</le> un trois pièces...
Diachronic Label: This type of labelling allows words or usages to be marked as ``old-fashioned", "obsolete" or ``archaic", etc. In the following, ``mae west" is marked as old- fashioned:
mae west:...&dated....
This marker's presence (it may be attached to SL and to TL items) allows both the decoder and the encoder to align words of similar currency in the two languages.
Evaluative Label: This warning label is used to indicate a favourable ("appreciative") or unfavourable ("pejorative") attitude in the speaker or writer, as"(pej)" in:
macho:...<la>pej</la>macho;...
where it shows that to describe someone as ``macho" is not a compliment (in English at least). Its presence (it may be attached to SL and to TL items) allows both the decoder and the encoder to align words indicating similar attitudes in the two languages.
Frequency Label: These indicate the uncommon or rare word forms, as in:
stride:...<s1 num=III><ps>vtr</ps> (<gr>prét</gr> <fs>strode</fs>,
<gr>pp</gr> <la>rare</la> <fs>stridden</fs>)...
Figuration Label: These indicate figurative, literal, or proverbial uses as in:
mist:...<la>fig</la> (<ic>of tears</ic>) voile...

Cultural Equivalent

For certain culture-specific concepts, the source lemma does not have a direct semantic equivalent, but there is an analogous concept in the target language culture which serves as a translation, as in:
high school:...<la>US Sch</la> &appr. lycée <gr>m</gr>;...

Sense Indicators

The indicator may be a synonym or paraphrase in the form of a single word, as "information" or"data" in:
material:... (<ic>information, data</ic>) documentation...
or a phrase, as ``become proficient in" in:
master: ...(<ic>learn, become proficient in or with</ic>) mantriser <co>subject, language, controls, computers, theory, basics, complexities</co>;...
The OHFD also includes sense clue labels as additional information. The sense clue is usually a brief phrase, as ``of specified nature" or ``requiring solution" or ``on agenda" in:
matter... <s2 num=1><la>gen</la> chose <gr>f</gr>; (<ic>of specified nature</ic>) affaire <gr>f</gr>; (<ic>requiring solution</ic>) problème <gr>m</gr>; (<ic>on agenda</ic>) point <gr>m</gr>;...
These may also be used to more finely distinguish subsenses within the same substituting indicator, like ``in chess" and ``in draughts" here:

man :...<s2 num=7><la>Games</la> (<ic>piece</ic>) (<ic>in chess</ic>) pièce <gr>f</gr>; (<ic>in draughts</ic>) pion <gr>m</gr></s2>; ...

Subcategorisation

It includes indication of prepositional usage, subject and object collocate for semantic complementation. The OHFD attempts to show all the structures necessary if the full semantic potential of the translation equivalent(s) is to be expressed grammatically. Examples of these structures ("through, au moyen de") are to be seen in:
mediate:...diffuser <co>idea, cult</co> (<pp><sp>through</sp> au moyen
mediate:...diffuser <co>idea, cult</co> (<pp><sp>through</sp> au moyen de, par</pp>)...
Similar to this type of information is information relating to the obligatory grammatical environment of the translation word, such as ``(+subj)" in marvel:... <ls>to &hw. that</ls> s'étonner de ce que (<gr>+ subj</gr>)...

Collocators

The type of collocator which may be offered depends on the word class of the headword; in the print dictionary the following types of collocators are used (the relationship is of course with one lexical unit, i.e. a single combination of lexical component and semantic component, a monosemous word or a polysemous word in one of its senses):

Verb headwords have as collocators nouns that are typical subjects of the verb as in :
merge:...<co>roads, rivers</co> se rejoindre; ...,
or nouns that are typical objects of the verb as in
merge: ...<lo>to &hw. sth into ou with sth</lo> incorporer qch en qch
<co>company, group</co>...
Adjective headwords have as collocators nouns that typically are modified by the adjective as in
messy:...(<ic>untidy</ic>) <co>house, room</co> en désordre;
Noun headwords have as collocators one of the following:
- (for deverbal nouns) nouns that are the object of the cognate verb as in management: ... (<ic>of business, company, hotel</ic>) gestion ... or that are the subject of the cognate verb as in
 maturation: ...(<ic>of whisky, wine</ic>) vieillissement ...;
- (for nouns that are nominalisations of adjectives) nouns modified by the cognate adjective as in
 mildness:...(<ic>of protest</ic>) modération;
- (for concrete nouns with real-word referents) nouns to which the headword is related by meronymy or nouns naming objects that stand in some other real-world relationship to the object that is the referent of the headword noun as in
 mug:...(<ic>for beer</ic>) chope...
- (for nouns used to modify other nouns) the collocators given are typical of the semantic set(s) thus modified as in
 mango:...(<ic>tree</ic>) manguier ... [<lc>grove</lc>] de manguiers;
- for adverbs: typical verbs and/or adjectives modified by the adverb as in
 marvellously:...<co>sing, get on</co> à merveille; <co>clever, painted</co> merveilleusement;

Collocations

Often, even within a single semantic sense, the lemma will translate differently depending on words which appear with it. For example, in:

accident: ...[<lc>figures</lc>, <lc>statistics</lc>] se rapportant aux accidents; [<lc>protection</lc>] contre les accidents;...

the lemma should be translated as ``se rapportant aux accidents" when it appears with ``statistics", but as ``contre les accidents" when it appears with ``protection.
Collocation (tagged by <lc> in the OHFD) should not be confused with either compounds (multi-word lexemes) (tagged <cw>), which include and translate the co-occurring words as part of the lemma, nor with collocators (tagged <co>), which help to identify distinct senses of the lemma.

Multi-Word Lexeme

Multi-word lexical units occurring as sublemma forms may generate almost all the lexicographic items that normally constitute a full dictionary entry, with the exception of a phonetic transcription. The three principal types of multi-word sublemmas are:
(a) compounds, as in:
mud:...<cw>mudbank</cw>... banc <gr>m</gr> de vase; ...<cw>mud bath</cw>...(<ic>for person, animal</ic>) bain <gr>m</gr>de boue;...
(b) phrasal verbs, as in:
miss:...<pvp><lp>&hw. out</lp> être lésé; ... (c) idioms, as in:
miss:...<id><li>to &hw. the boat ou bus</li>&coll. rater le coche</id>;...
Multi-word lexemes may range from fixed phrases (e.g. ``by and large") through idiomatic verb phrases ("raining cats and dogs") to whole sentences, such as proverbs or phatic phrases ("Have a nice day").

Gloss

This is given when there is no direct TL equivalent of the lemma, as in:
mid-terrace:...[<lc>house</lc>, <lc>property</lc>] <gl>situé au milieu d'un alignement de maisons identiques et contiguës</gl>...

Van Dale Bilingual Dutch-English

The Van Dale Bilingual Dictionaries are developed for native Dutch speakers. This means that the resources contain only very limited information on the Dutch words and much more information on the Foreign-Language target words. The Dutch-English dictionary is described here [Mar86].

The Van Dale Dutch-English is a traditional bilingual dictionary. It is structured using a tagged field structure, which makes it relatively easy to extract a computer tractable database from it. However, the scope of the fields is not always explicit and the values within the fields are often undifferentiated and consist of free-text.

The entry-structure is homograph-based but homographs are distinguished only when the part-of-speech differs and/or the pronunciation. Sub-homographs are used when senses differ in major grammatical properties such as valency, countability, predicate/attributive usage. The figures supplied in Table 3.17 provide an indication of size and coverage.

Table 3.17: Number of Entries, Senses and Translations in the Van Dale Dutch-English Dictionary

Entries	90925
Homographs	2967
Sub-homographs	6769
Senses	127024
Main Translations	145511
Secondary Translations	104181
Examples	111226

In addition to some grammatical information on the Dutch words and the English translations, the dictionary contains a large amount of semantic information restricting the senses and the translations:

Sense-indicators: (53368 tokens) to specify the Dutch senses or polysemous entries. These contain bits and pieces from original definitions (often a genus word).
Biological gender marker: for English translations. This is necessary to differentiate translations when the source and target language have different words for male or female species: 286 translations are labeled as male, 407 translations as female.
Usage labels for domain, style and register: Applies to both Dutch senses and their English translations.
Dialect labels: for Dutch senses and their English translations
Context markers: (23723 tokens, 16482 types). These are semantic constraints differentiating the context of multiple translations, and to limit the scope of translations having a narrower context than the Dutch source sense.

The usage labels and the domain labels are mostly stored in the same field. Differentiation has to be done by some parsing. The usage labels form a limited closed set of abbreviations and codes, the domain labels are free text. For the main-translations about 400 different types of values occur.

The context markers and sense-indicators are in textual form (Dutch). Their interpretation varies from domain-labels, selection restrictions, semantic classifications and semantic properties. The difference in interpretation is however not coded but can partially be inferred, e.g.:

a noun used as a constraint or sense-indicator for a verb or adjective is mostly a selection restriction;
a noun used as a constraint for a noun is a classification;
an adjective constraining a noun is a feature;
an adverb constraining a verb indicates a more specific manner.

Finally, a lot of collocational information is stored in the examples and their translations. The typical combination words are marked and distinguished per part-of-speech. If the combination is compositional then the correlating meaning of the entry is given, in the case of idiosyncratic collocation there is a mark. The examples and their translations can be seen as partially structured context specification for the Dutch and English word pairs.

Relations to Notions of Lexical Semantics

Bilingual resources often contain information which disambiguate the usage of words in the target languages, but only in so far it is necessary to select the correct alternative. The information takes the form of semantic classes, selection restrictions, register and domain labels or morpho-syntactic information, but it requires considerable processing to differentiate between them. Somewhat more sophisticated information is available in the form of the examples and the translations of the examples. The combinatoric constraints provide very useful information comparable to Mel'cuk's lexical semantic functions [Mel89].

LE Uses

Obviously, bilingual dictionaries are useful input for constructing Machine-Translation systems, although a lot of additional work has to be carried out to formalize and complete the information. In Acquilex, it has neverthless been used to automatically extract equivalence-relations between English and Dutch word-senses (see [Cop95b]).

Because elements are tagged by functions the OUP is a very convenient dictionary to retrieve information from. The Oxford Hachette has been successfully used to design an intelligent dictionary lookup, Locolex, first developed in the framework of the COMPASS European project.

Furthermore, [Mak95] show that it is possible to achieve high degree of sense-disambiguation using the rich annotations in the examples and their translations in bilingual dictionaries.

Next: Semantic Requirements for NL Up: Lexical Semantic Resources Previous: Experimental NLP lexicons

EAGLES Central Secretariat eagles@ilc.cnr.it