Next: Semantic Requirements for NL
Up: Lexical Semantic Resources
Previous: Experimental NLP lexicons
Subsections
Bilingual Dictionaries
In the final section of this chapter, two traditional bilingual
dictionaries will be described that have been used in various projects
to derive information for NLP or that are directly used in
computerized tools. The Bilingual Oxford Hachette French and the Van
Dale Dutch-English dictionaries are just illustrative for many other
bilinguals that can be used in this way.
The bilingual Oxford Hachette French dictionary
The bilingual Oxford-Hachette French Dictionary (French-English)
(OHFD) is intended for general use and is not specific to any domain.
It includes most abbreviations, acronyms, many prefixes and suffixes,
and some proper names of people and places in cases of unpredictable
translation problems. It also includes semi- technical terms which
might be found in standard texts, but omits highly domain-specific
technical terms. It is designed to be used for production,
translation, or comprehension, by native speakers of either English or
French.
The electronic version of the OHFD is an sgml tagged
dictionary. Therefore each element is tagged by function. For
instance there are tags to indicate, part-of-speech, different meaning
within a part of speech, pronunciation, usage, idiomatic expressions,
domains, subject and object collocates, prepositions,
etc. Unfortunatly, translations are not tagged at all, which makes the
dictionary sometimes difficult to parse.
The English-French side has about 47 539 entries (most of the
compounds are entries by themselves) which are divided into: 31061
nouns, 11089 adjectives, 5632 verbs, 2761 adverbs, others 165. There
are 407 top level labels, a good part of which include a distinction
between British English and American English. They can be just one
concept (Bio), or a combination of a concept and language level (GB
Jur).
The French-English side has 38944 entries (about 10.000 compounds
which are part of entries themselves) which are divided into: 25415
nouns, 8399 adjectives, 4805 verbs, 1164 adverbs and 890 others. The
top level labels are about 200 labels.
Homographs are represented in two ways:
- As separate entries. (We shall henceforth use ``homograph" to denote these
separated entries.). In this case, a number will be included with the
headword lemma to distinguish it from its homographs, as in:
row<hm>1</hm>...<ph>r@U</ph>...(<ic>line</ic>)...
row<hm>2</hm>...<ph>raU</ph>... (<ic>dispute</ic>)...;
- As major subdivisions within a single entry. (We shall henceforth use
``grammatical category" to denote such subdivisions.) In the OHFD, these
are labelled with roman numerals, as in:
row<hm>1</hm>...<ph>r@U</ph>...
<s1 num=I><ps>n</ps> ...(<ic>line</ic>)...
<s1 num=III><ps>vi</ps> <ann><la>Naut</la>,
<la>Sport</la>...
The basis for deciding when to assign separate homograph entries to identically
spelt words is a difference of pronunciation, not semantic or etymological
independence. In addition, function words are given a separate entry
from homographs even of the same pronunciation, as in:
mine<hm>1</hm>...<ph>maIn</ph>...<fi>le mien</fi>, <fi>la mienne</fi>...
mine<hm>2</hm>...<ph>maIn</ph>...<la>Mining</la> mine <gr>f</gr>...
Monosemous words have no overt identifier of their single sense, as in
mineralogy:...<hg><ps>n</ps></hg> minéralogie <gr>f</gr>.</se>
This component is found only in the entries for polysemous words, as in:
migrant:...<s2 num=1><la>Sociol</la>... <s2 num=2><la>Zool</la>...
Senses are distinguished in considerable detail, although it should be remembered that in a bilingual
dictionary the sense differentiation of the headword is often affected by target language
(TL) equivalence. The original source language (SL) analysis of, for instance, the English word
`column' would yield eight or nine senses, covering architectural, military
and newspaper columns, as well as columns of figures and columns of
smoke; with French as the TL, there is only one 'sense' in the `column'
entry, since every sense of the English word has the French equivalent
`colonne'.
They include:
- Stylistic Label:
An example of a style label is ``littér" (literary) in
milling:...<la>littér</la> <co>crowd</co> grouillant...
Other items may be marked as belonging to administrative, or technical, or
poetic language.
The OHFD also marks formal terms as ``fml" (French ``sout" for ``soutenu").
However, in order to avoid terms such as ``colloquial" or ``familiar", which
are open to individual interpretation, the OHFD has devised a
system of symbols showing points on a scale from (roughly) ``informal" to ``informal, may be taboo" .
The presence of these labels, which may be attached to SL and to TL items,
allows both the decoder and the encoder to align words of similar style in
the two languages.
- Diatechnical Label:
Semantic domain labels are exceedingly frequent in the OHFD, as may be
seen in:
matrix:...<la>Anat</la>, <la>Comput</la>, <la>Ling</la>, <la>Math</la>,
<la>Print</la>, <la>Tech</la> matrice <gr>f</gr>; <la>Miner</la> gangue
<gr>f</gr>...
Moreover, the dictionary tapes actually contain a
more comprehensive coverage of semantic domain labels than appears in
the printed text. Whenever the lexicographers believed that the lemma
was frequently (or exclusively) used when a particular subject was under
discussion, they marked the lexical unit with an appropriate semantic
domain marker.
- Diatopic Label:
This type of labelling is used in the print dictionary to mark such regional
varieties as Belgian French, Swiss German and American English, as ``US"
in
math:... <la>US</la> =<xr><x>maths</x></xr>...
or ``GB" in
bedroom:...<le>a two &hw. flat <la>GB</la> <u>ou</u> apartment</le> un
trois pièces...
- Diachronic Label:
This type of labelling allows words or usages to be marked as ``old-fashioned",
"obsolete" or ``archaic", etc. In the following, ``mae west" is marked as old-
fashioned:
mae west:...&dated....
This marker's presence (it may be attached to SL and to TL items) allows both
the decoder and the encoder to align words of similar currency in the two
languages.
- Evaluative Label:
This warning label is used to indicate a favourable ("appreciative") or
unfavourable ("pejorative") attitude in the speaker or writer, as"(pej)" in:
macho:...<la>pej</la>macho;...
where it shows that to describe someone as ``macho" is not a compliment
(in English at least). Its presence (it may be attached to SL and to TL
items) allows both the decoder and the encoder to align words indicating
similar attitudes in the two languages.
- Frequency Label:
These indicate the uncommon or rare word forms, as in:
stride:...<s1 num=III><ps>vtr</ps> (<gr>prét</gr> <fs>strode</fs>,
<gr>pp</gr> <la>rare</la> <fs>stridden</fs>)...
- Figuration Label:
These indicate figurative, literal, or proverbial uses as in:
mist:...<la>fig</la> (<ic>of tears</ic>) voile...
For certain culture-specific concepts, the source lemma does not have a direct
semantic equivalent, but there is an analogous concept in the target
language culture which serves as a translation, as in:
high school:...<la>US Sch</la> &appr. lycée <gr>m</gr>;...
The indicator may be a synonym or paraphrase in the form of a single word, as
"information" or"data" in:
material:... (<ic>information, data</ic>) documentation...
or a phrase, as ``become proficient in" in:
master: ...(<ic>learn, become proficient in or with</ic>) mantriser
<co>subject, language, controls, computers, theory, basics,
complexities</co>;...
The OHFD also includes sense clue labels as additional information.
The sense clue is usually a brief phrase, as ``of specified nature" or ``requiring
solution" or ``on agenda" in:
matter... <s2 num=1><la>gen</la> chose <gr>f</gr>; (<ic>of specified
nature</ic>) affaire <gr>f</gr>; (<ic>requiring solution</ic>) problème
<gr>m</gr>; (<ic>on agenda</ic>) point <gr>m</gr>;...
These may also be used to more finely distinguish subsenses within the same
substituting indicator, like ``in chess" and ``in draughts" here:
man :...<s2 num=7><la>Games</la> (<ic>piece</ic>) (<ic>in chess</ic>)
pièce <gr>f</gr>; (<ic>in draughts</ic>) pion <gr>m</gr></s2>; ...
It includes indication of prepositional usage, subject and object collocate for semantic complementation.
The OHFD attempts to show all the
structures necessary if the full semantic potential of the translation equivalent(s) is
to be expressed grammatically. Examples of these structures ("through, au
moyen de") are to be seen in:
mediate:...diffuser <co>idea, cult</co> (<pp><sp>through</sp> au moyen
mediate:...diffuser <co>idea, cult</co> (<pp><sp>through</sp> au moyen de, par</pp>)...
Similar to this type of information is information relating to the obligatory
grammatical environment of the translation word, such as ``(+subj)" in
marvel:... <ls>to &hw. that</ls> s'étonner de ce que (<gr>+ subj</gr>)...
The type of collocator which may be offered depends on the word class of the
headword; in the print dictionary the following types of collocators are
used (the relationship is of course with one lexical unit, i.e. a single
combination of lexical component and semantic component, a
monosemous word or a polysemous word in one of its senses):
- Verb headwords have as collocators nouns that are typical subjects of the
verb as in :
merge:...<co>roads, rivers</co> se rejoindre; ...,
or nouns that are typical objects of the verb as in
merge: ...<lo>to &hw. sth into <u>ou</u> with sth</lo> incorporer qch en qch
<co>company, group</co>...
- Adjective headwords have as collocators nouns that typically
are modified by the adjective as in
messy:...(<ic>untidy</ic>) <co>house, room</co> en désordre;
- Noun headwords have as collocators one of the following:
- (for deverbal nouns) nouns that are the object of the cognate verb as in
management: ... (<ic>of business, company, hotel</ic>)
gestion ... or that are the subject of the cognate verb as in
maturation: ...(<ic>of whisky, wine</ic>) vieillissement ...;
- (for nouns that are nominalisations of adjectives) nouns modified by the
cognate adjective as in
mildness:...(<ic>of protest</ic>) modération;
- (for concrete nouns with real-word referents) nouns to which the
headword is related by meronymy or nouns naming objects that stand in
some other real-world relationship to the object that is the referent of the
headword noun as in
mug:...(<ic>for beer</ic>) chope...
- (for nouns used to modify other nouns) the collocators given are
typical of the semantic set(s) thus modified as in
mango:...(<ic>tree</ic>) manguier ... [<lc>grove</lc>] de manguiers;
- for adverbs: typical verbs and/or adjectives modified by the adverb
as in
marvellously:...<co>sing, get on</co> à merveille;
<co>clever, painted</co> merveilleusement;
Often, even within a single semantic sense, the lemma will translate
differently depending on words which appear with it. For example,
in:
accident: ...[<lc>figures</lc>,
<lc>statistics</lc>] se rapportant aux accidents;
[<lc>protection</lc>] contre les accidents;...
the lemma should be translated as ``se rapportant aux accidents" when
it appears with ``statistics", but as ``contre les accidents" when it
appears with ``protection.
Collocation (tagged by <lc> in the
OHFD) should not be confused with either compounds (multi-word
lexemes) (tagged <cw>), which include and translate the
co-occurring words as part of the lemma, nor with collocators (tagged
<co>), which help to identify distinct senses of the lemma.
Multi-word lexical units occurring as sublemma forms may generate almost all
the lexicographic items that normally constitute a full dictionary entry,
with the exception of a phonetic transcription. The three principal types
of multi-word sublemmas are:
(a) compounds, as in:
mud:...<cw>mudbank</cw>... banc <gr>m</gr> de vase; ...<cw>mud
bath</cw>...(<ic>for person, animal</ic>) bain <gr>m</gr>de boue;...
(b) phrasal verbs, as in:
miss:...<pvp><lp>&hw. out</lp> être lésé; ...
(c) idioms, as in:
miss:...<id><li>to &hw. the boat <u>ou</u> bus</li>&coll. rater le
coche</id>;...
Multi-word lexemes may range from fixed phrases (e.g. ``by and large") through
idiomatic verb phrases ("raining cats and dogs") to whole sentences, such
as proverbs or phatic phrases ("Have a nice day").
This is given when there is no direct TL equivalent of the lemma, as in:
mid-terrace:...[<lc>house</lc>, <lc>property</lc>] <gl>situé au milieu d'un
alignement de maisons identiques et contiguës</gl>...
Van Dale Bilingual Dutch-English
The Van Dale Bilingual Dictionaries are developed for native Dutch
speakers. This means that the resources contain only very limited
information on the Dutch words and much more information on the
Foreign-Language target words. The Dutch-English dictionary is
described here [Mar86].
The Van Dale Dutch-English is a traditional bilingual dictionary. It
is structured using a tagged field structure, which makes it
relatively easy to extract a computer tractable database from
it. However, the scope of the fields is not always explicit and the
values within the fields are often undifferentiated and consist of
free-text.
The entry-structure is homograph-based but homographs are
distinguished only when the part-of-speech differs and/or the
pronunciation. Sub-homographs are used when senses differ in major
grammatical properties such as valency, countability,
predicate/attributive usage. The figures supplied in
Table 3.17 provide an indication of size and coverage.
Table 3.17:
Number of Entries, Senses and Translations in the Van Dale
Dutch-English Dictionary
Entries |
90925 |
Homographs |
2967 |
Sub-homographs |
6769 |
Senses |
127024 |
Main Translations |
145511 |
Secondary Translations |
104181 |
Examples |
111226 |
|
In addition to some grammatical information on the Dutch words and the
English translations, the dictionary contains a large amount of
semantic information restricting the senses and the translations:
- Sense-indicators
- (53368 tokens) to specify the Dutch senses or
polysemous entries. These contain bits and pieces from
original definitions (often a genus word).
- Biological gender marker
- for English translations. This is
necessary to differentiate translations when the source and target
language have different words for male or female species: 286
translations are labeled as male, 407 translations as female.
- Usage labels for domain, style and register
- Applies to both
Dutch senses and their English translations.
- Dialect labels
- for Dutch senses and their English translations
- Context markers
- (23723 tokens, 16482 types). These are
semantic constraints differentiating the context of multiple
translations, and to limit the scope of translations having a narrower
context than the Dutch source sense.
The usage labels and the domain labels are mostly stored in the same
field. Differentiation has to be done by some parsing. The usage
labels form a limited closed set of abbreviations and codes, the
domain labels are free text. For the main-translations about 400
different types of values occur.
The context markers and sense-indicators are in textual form
(Dutch). Their interpretation varies from domain-labels, selection
restrictions, semantic classifications and semantic properties. The
difference in interpretation is however not coded but can partially be
inferred, e.g.:
- a noun used as a constraint or sense-indicator for a verb or adjective is mostly a
selection restriction;
- a noun used as a constraint for a noun is a classification;
- an adjective constraining a noun is a feature;
- an adverb constraining a verb indicates a more specific manner.
Finally, a lot of collocational information is stored in the examples
and their translations. The typical combination words are marked and
distinguished per part-of-speech. If the combination is compositional
then the correlating meaning of the entry is given, in the case of
idiosyncratic collocation there is a mark. The examples and their
translations can be seen as partially structured context specification
for the Dutch and English word pairs.
Relations to Notions of Lexical Semantics
Bilingual resources often contain information which disambiguate the
usage of words in the target languages, but only in so far it is
necessary to select the correct alternative. The information takes the
form of semantic classes, selection restrictions, register and domain
labels or morpho-syntactic information, but it requires considerable
processing to differentiate between them. Somewhat more sophisticated
information is available in the form of the examples and the
translations of the examples. The combinatoric constraints provide
very useful information comparable to Mel'cuk's lexical semantic
functions [Mel89].
LE Uses
Obviously, bilingual dictionaries are useful input for constructing
Machine-Translation systems, although a lot of additional work has to
be carried out to formalize and complete the information. In Acquilex,
it has neverthless been used to automatically extract
equivalence-relations between English and Dutch word-senses (see
[Cop95b]).
Because elements are tagged by functions the OUP is a very convenient
dictionary to retrieve information from. The Oxford Hachette has been
successfully used to design an intelligent dictionary lookup, Locolex,
first developed in the framework of the COMPASS European project.
Furthermore, [Mak95] show that it is possible to achieve high
degree of sense-disambiguation using the rich annotations in the
examples and their translations in bilingual dictionaries.
Next: Semantic Requirements for NL
Up: Lexical Semantic Resources
Previous: Experimental NLP lexicons
EAGLES Central Secretariat eagles@ilc.cnr.it