Next: Word Sense Disambiguation Up: Component Technologies Previous: Word Clustering

Multiword Recognition and Extraction

Introduction

This section examines approaches to multiword recognition and extraction and automatic term recognition (ATR). We will examine linguistic and statistical approaches to ATR. There are no purely statistical approaches to ATR. Statistical approaches come rather from the areas of collocation extraction and IR. The second and third sections examine collocation extraction and the sub-area of IR that relates to ATR, indexing, and how they influence multiword ATR.

Survey of Approaches

Linguistic Approaches

Researchers on multiword ATR seem to agree that multiword terms are mainly noun phrases, but their opinions differ on the type of noun phrases they actually extract. In the overview that follows, most systems rely on syntactic criteria and do not use any morphological processes. An exception is Damerau's work [Dam93].

Justeson and Katz [Jus95] work on noun phrases, mostly noun compounds, including compound adjectives and verbs albeit in very small proportions. They use the following regular expression for the extraction of noun phrases

((Adj|Noun)⁺|((Adj|Noun)^*(Noun-Prep)^?)(Adj|Noun)^*)Noun

(5.1)

They incorporate the preposition of, showing however, that when of is included in the regular expression, there is a significant drop on precision (this drop is too high to justify the possible gains on recall). Their system does not allow any term modification.

Daille et al. [Dai94] also concentrate on noun phrases. Term formation patterns for base Multi-Word Unit (base-MWU), consist mainly of 2 elements (nouns, adjectives, verbs or adverbs). The patterns for English are:

1.: Adj Noun
2.: Noun Noun

while for French

1.: Noun Adj
2.: Noun Noun
3.: Noun de (det) Noun
4.: Noun prep (det) Noun

They suggest that MWU of length 3 or more are mostly built from base-MWU using one of the following operations:

1.: overcomposition, ([side lobe] regrowth)
2.: modification, (interfering [earth (-) station])
3.: coordination, (packet assembly/disassembly)

However, their current work deals with base-MWU only.

Bourigault [Bou92] also deals with noun phrases mainly consisting of adjectives and nouns that can contain prepositions, usually de and à, and hardly any conjugated verbs. He argues that terminological units obey specific rules of syntactic formation. His system does not extract only terms.

In [Dag94a], noun phrases that are extracted consist of one or more nouns that do not belong to a stoplist. A stop list is also used by [Dam93]. Damerau uses morphological analysis for inflectional normalisation.

Statistical Approaches

The most common statistics used, is frequency of occurrence of the potential multiword term ([Jus95,Dag94a,Dai94,Eng94,Per91]).

[Dai94] investigate more statistical scores, since the frequency of occurrence would not retrieve infrequent terms. Several scores have been tested, among which are the one itemized below where:

(w₁,w₂) is the pair of words, a is the frequency of occurrence of both w₁ and w₂

b is the frequency of occurrence of w₁ only

c is the frequency of occurrence of w₂ only, and

d is the frequency of occurrence of pairs not containing neither w₁, nor w₂.

Mutual Information, first defined by [Fan61], and introduced again by [Chu90] as association ratio, for measuring the degree of cohesiveness of two words

(5.2)

Mutual Information tends to extract frozen compounds from the general language.
coefficient, introduced by [Gal91] for the concordance of parallel texts

(5.3)
Loglike coefficient, introduced by [Dun93]
Diversity, first introduced by [Sha48], and later proposed for the extraction of frozen compounds or collocations.

As no combination of scores improved the results and they used the Loglike criterion.

[VanE94] uses statistics for finding the pairs of terms from the source and target language. The translations of the terms of the source language are ranked according to the following measure

(5.4)

where s_l is the source language term, t_l a target language term, and freq_local, freq_global, local and global frequencies correspondingly.

[Dam93] uses the difference of two association ratios, one for a corpus consisting of various subdomains and one for a subcorpus of a specific domain.

(5.5)

The probabilities P are estimated by the frequencies normalised by the size of the corpus, t stands for the total corpus and s for the subject sub-corpus.

Relevant notions of lexical semantics

Multi-word recognition and compounding can benefit from the use of semantic relations between head and its modifier(s). It is nevertheless, an enormous problem to identify the types of semantic relationships (which have to be inferred) occurring. The possible interpretations between constituents may be constrained if we are working in a sublanguage. [Fin80,Fin86].

For term interpretation semantic relations, taxonomic information is of relevance. For term extraction collocational information has been used. Recently, there is a move for term sense disambiguation which uses techniques close to word clustering (see section 5.1.2, especially taxonomy-based semantic similarity). Automatic term sense disambiguation can be based on the identification of relevant contextual information.(see section 5.2.3 Frantzi & Ananiadou). General information about semantic roles from the corpus with domain-specific information about semantic categories from a specialised thesaurus can be combined. [May98]. Term sense disambiguation differs from word sense disambiguation in that the majority of technical terms are multiword. The ambiguity of terms is generally not caused by different senses of the individual components of the term, but by different senses of the term as a whole. The different meanings of the term may be linked to different domains, but they may equally be present within a specific domain.

NLP applications using Multiword Recognition/Extraction

LEXTER

LEXTER [Bou92] takes as input a corpus tagged with a part-of-speech tagger, that consists of 1700 texts from the Research Development Division of Electricité de France, with a total of 1,200,000 words. It works in two stages. During analysis, the maximum length noun phrases are extracted, taking into consideration potential ``terminological frontiers'' (pronouns, conjuction, conjugated verbs, etc. ). He suggests that more complicated rules can be added to find the boundaries (e.g. between noun phrases related to the same verb or noun). Some of these rules were intuitively suggested and after being tested for validity on the corpus, were added to LEXTER.

The second stage, parsing, extracts substrings from the noun phrases extracted from the previous stage, as additional likely terminological units. These are extracted according their position within the maximum length noun phrases.

[Bou92] argues for the non-necessity of complete syntactical analysis, but the use of a surface grammatical one.

TERMIGHT

TERMIGHT [Dag94b] has been designed as a tool for the extraction of terms for human and machine translation. It consists of a monolingual and a bilingual part.

As a tool, it seems to be more concerned with issues like speed and how easy it is for users. The text is tagged and terms are extracted according to a regular expression and a stop-list.

Termight has a high recall, partly expected since there is no threshold on the frequency of occurrence of the candidate terms, but partly not, since a lot of terms are expected to contain adjectives, which are not treated at the current version of Termight.

As for the bilingual part, TERMIGHT identified the candidate translations of a term, based on word alignment. The candidate translations for each source term are displayed, sorted according to their frequency as translations of the source term.

Daille et al., 1994

Daille et al. work on English and French corpora [Dai94], each consisting of 200,000 words of the field of telecommunications. Only 2-word terms are considered. They are extracted according to morpho-sysntactic criteria, allowing variations on terms. All the variations add up as a list to each term. To the candidate terms extracted, a statistical score (likelihood ratio) is to be applied as an additional filter.

Justeson and Katz, 1995

No parsing or tagging is used by [Jus95], due to their error rate. Instead a lexical database of about 100,000 entries is used, assigning to each word, a part-of-speech, after (basic) morphological analysis is applied if/as needed. At the assignment of the part-of-speech, preference is given to nouns then adjectives then prepositions.

The approach gives preference on recall over precision, unless a high improvement on precision can be gained with a low loss on recall. This is actually the case where the preposition is excluded from the regular expression.

Van der Eijk, 1994

His work is on bilingual terminology (between Dutch and English) [VanE94]. The texts are aligned at sentence level (segments of one or more sentences), and after being tagged, the noun phrases of the form are extracted.

The following tables are created: a table that holds the global frequencies of the target language terms and source language term, a table that holds the local frequencies of the target language terms. The candidate translation terms are ranked according to , where t_l stands for translation terms, and s_l for source terms. The score should be greater to 1 for the target term to be extracted as a translation to the source term. The assumption is that the translated term is more likely to be more frequent in the target text segments ligned to the source text segments that contain the source term, than in the entire target text.

TERMINO

TERMINO adopts a morphosyntatic approach. The morphological analyser finds the stemming and does the part-of-speech tagging. The syntactic part consists of the parser and the synapsy detector. The parser resolves the remaining lexical ambiguity and gives the syntactic structure. A synapsy is a ``polylexical unit of syntactic origin forming the nucleous of a noun phrase'' ([Dav90]:145). It comprises a noun head that may be preceded by an adjectival phrase or/and may be followed by an adjectival phrase or prepositional phrase complement.

The synapsy detector consists of two parts. The first part, the synapsy builder, is activated each time a noun phrase is encounted by the parser. At this stage the head of the noun phrase is assigned a syntactic structure. The second part, the sysnapsy comparator, applies empirical criteria to filter out some of the noise. This criteria include frequency and category, as well as stop lists for the adjectival modifiers and the position of head.

Term Identification using Contextual Cues-Frantzi & Ananiadou

Till now the information (linguistic and statistical) used for the extraction of terms, was `internal', ie. coming from the candidate term itself. We see how the incorporation of `external' information derived from the context of the candidate term. It is embedded to the C-value method for automatic term recognition, in the form of weights constructed by statistical characteristics of the context of the candidate term. The environment of words has been previously used for the construction of thesaurus [Gre94]. In that case, words that share the same context are viewed as synonymous. In our case, ``extended word units can be freely modified while multiword terms cannot'' [Sag90]. We therefore say that terms of the same domain share some common context: the form ``shows'' of the verb ``to show'' in medical domains, is almost always followed by a term. So, if we know that ``to show'' is such a verb (that appears with terms), we can increase the possibility of a string being a term, if it appears with this verb.

The verb of the previous example is carrying information within the medical domain. There are cases where a particular environment that carries such information can be found in more than one domains, like the form ``is called'' of the verb ``to call'', that is often involved in definitions of terms in various domains. Our claim is that context, since it carries such information should be involved in the procedure for the extraction of terms. We incorporate context information to the approach of Frantzi & Ananiadou [Fra96a] for the extraction of multiword terms in a fully automatic way^5.1. The corpus used is tagged. From the tagged corpus, the n-grams using the following regular expression are extracted (Noun|Adjective)⁺Noun The choice of the regular expression affects the precision and recall of the Our choice is a compromise between the two. For these n-grams, C-value, a statistical measure for the extraction of terms, based on the frequency of occurrence, and ``sensitive'' to nested terms^5.2 is evaluated [Fra96a].

According to [Fra96b], the C-value integrates the parameter of the length of the n-gram. The length was used as a parameter when C-value was applied for the extraction of collocations [Fra96b]. Its weight ie weakened as shown in 5.6, where:

a is the examined n-gram,

|a| the length, in terms of number of words, of a,

f(a) the frequency of a in the corpus,

b_i the candidate extracted terms that contain a,

c(a) the number of those candidate terms.

(5.6)

The output of the application of C-value on these n-grams is a list of potential terms. The higher the C-value of an n-gram, the more possible for it to be a term.

¿From that list, the higher ranked terms are considered for the context evaluation. By context, we mean the verbs, adjectives, nouns, that appear with the candidate term. We attach a weight to those verbs, adjectives, nouns. Three parameters are considered for the evaluation of these weights: the number of candidate terms the word (verb, adjective, noun) appeared with, its frequency as a context word, and its total frequency in the corpus. The above are combined as shown in 5.7, where:

w is the noun/verb/adjective to be assigned a weight,

n the total number of candidate terms considered,

t(w) the number of candidate terms the word w appears with,

ft(w) w's total frequency appearing with candidate terms,

f(w) w's total frequency in the corpus.

(5.7)

For each of the n-grams of the previously created list, its context words, (verbs, adjectives or nouns) are extracted. These context words have from the previous stage a weight assigned to them (that can be 0 if the word was not met when the weights were assigned). The sum of these weights will give the context weight wei(a) for each n-gram, as shown in 5.8 where C_a is the context of the n-gram a.

(5.8)

The n-grams will be re-ranked as shown in 5.9 where:

a is the examined n-gram,

, the previously calculated ,

wei(a), the context weight for a,

N, the size of the corpus in terms of number of words.

(5.9)

Rank Xerox Multiword lexeme recognition and extraction

To recognize and extract Multi-word lexeme (MWL) we use the finite state technology ([Kar93,Kar92] which provides an efficient and fast implementation environment.

MWL recognition

Multi-word expressions cannot be properly understood - or even translated - if they are not recognized as complex lexical units. We call such expressions multi-word lexemes (MWL). These include idioms (to rack one's brains over sth), proverbial sayings (birds of a feather flock together), phrasal verbs (to come up with), lexical and grammatical collocations (to make love, with regard to), compounds (on-line dictionary).

Some MWLs always occur in exactly the same form and can therefore be easily recognised by their lexical pattern. This is the case for expressions like footloose and fancy free or out of the blue. However, most MWLs allow different types of variation and modification^5.3. To be able to recognize such MWLs in a text, occurrences deviating from the standard or base form of the MWL have to be identified, e.g. different inflections, word orderings and modified uses. For example, in casser sa pipe (to kick the bucket), no plural is possible for the noun, the verb cannot be replaced by its near-synonym briser, nor can the phrase be passivised without losing its idiomatic meaning. Yet, the verb itself can be inflected.

Simple string matching methods are too weak to identify MWLs because most of them are not completely fixed. Besides, the variations they can undergo are, in most cases, lexicographically not well defined. A dictionary entry usually provides the reader with one form of the expression - not necessarily the base or canonical form -, giving no details about allowed variations, except sometimes lexical variants. This type of missing information can be stated with local grammar rules which have more general expressiveness than traditional descriptions.

Local grammar rules describe restrictions of MWLs compared to general rules by implicitly stating allowed variations of the MWL compared to the default case of a completely fixed MWL. In the default case, all restrictions apply, i.e. no variation at all is allowed, and the MWL is represented by the surface form of all lexical components in a fixed order. Violations to standard grammatical rules, e.g. missing constituents or agreement violations, need not be stated explicitly, though if necessary they can be expressed to distinguish the idiomatic from a literal use of the lexical pattern. To write the local grammar rules we use the two-level formalism IDAREX (IDioms As Regular EXpressions) developed as part of the FSC finite state compiler at Rank Xerox Research Centre ^5.4. The local grammar rules we write are restricted to cover at most sentence length patterns. They are formulated as generally as possible, allowing for overgeneration. Although more specific and restrictive rules could be written, this is unnecessary because we assume that there is no ill-formed input. Indeed, it does not matter if the rules allow more variations than the ones that will actually appear in texts as long as idiomatic and literal uses can be distinguished. For instance, as long as we are not concerned with the semantic representation of MWLs, the local grammar rule for the French expression peser dans la balance accepts semantically correct phrases such as peser lourd dans la balance or peser énormément dans la balance, but also the semantically ill-formed peser *ardemment dans la balance. More generally, local grammar rules are also useful for syntactic parsing, e.g. by describing complex adverbials such as dates Le lundi 21 aout au matin^5.5 or any other expressions that do not follow the general syntax. In many cases the syntactic parser would just fail because it would not be able to analyse properly the multi-word expression embedded in a larger phrase. For instance in German, the general syntax states that a determiner should precede any count noun. This rule is infringed in the MWL von Haus aus (originally).

Regarding the techniques we use, the two-level morphological approach based on finite state technology together with the IDAREX formalism, have the advantage of providing us with a compact representation. As we saw, we can define general variables, such as ``any adverb'' (ADV) or more specific morphological variables, such as ``only verbs in the third person singular'' (Vsg3). This relieves the lexicographer from the burden of explicitly listing all the possible forms. Functional variables provide a means to formulate generalizations about patterns that can occur for a whole class of MWLs. Besides, the two levels enable us to express facts either with the surface form or with the lexical form. Therefore, when we want to say that a given form is fixed, we just have to use the surface form without bothering with all the features on the lexical side.

In this technology, operations like addition, intersection, substraction and composition are allowed on the networks generated from regular expressions. Although we have not used this possibility in our work on local grammars yet, it is very powerful. For instance, if we are concerned about the semantics of a MWL and want to be more restrictive with the rules, we can build new regular expressions and substract the resulting networks from the one we already built. Such additional regular expressions would, for example, express facts about the compatibility of semantic classes of adjectives and nouns.

MWL extraction

Much of the terminology found in a corpus is composed of noun phrases. One extension of our NLP suite is a noun phrase extraction step which can follow part-of-speech tagging [Sch96]. In order to perform this step, transducers have been compiled from finite-state expressions which are basically grammar rules describing the contour and patterns of noun phrases for each language for which a lexicon and tagger are made. The patterns can include surface forms as well as part-of-speech tags. When these transducers are applied to tagged text, noun phrase boundaries are inserted. For example, consider the Dutch text:

De reparatie- en afstelprocedures zijn bedoeld ter ondersteuning voor zowel de volledig gediplomeerde monteur als de monteur met minder ervaring. (The repair and adjustment procedures are meant to aid the fitter who has completed his degree work as well as the less experienced fitter.)

After part-of-speech tagging, the noun phrase transducers will recognize and isolate the following noun phrases: reparatie-en afstelprocedures, ondersteuning, volledig gediplomeerde monteur, monteur and ervaring. The current noun phrase mark-up was designed basically for terminology extraction from technical manuals. It covers relatively simple noun phrase detection, i.e. some constructions such as relative clauses are not included.

Because one can easily add a new regular expression to handle more constructions, more elaborate patterns including verbs can be extracted. The same automatic means have been used to extract collocations from corpora, in particular, support verbs for nominalizations. In English, an example of proper support verb choice is one makes a declaration and not one does a declaration. Make is said to support the nominalization declaration which carries the semantic weight of the phrase. We used NLP suites followed by syntactic pattern matching slightly more complicated than the noun phrase extractors of the previous section, in order to extract verbal categorization patterns for around 100 nominalizations of communication verbs in English and French [Gre96].

Similar approaches are used to identify more sepcific items such as: dates, proper names. They use a combination of regular expressions as described above and specific lexical ressources including, for instance, semantic information.

Related Areas and Techniques

Term Recognition is relevant to: Machine Translation (for all types of MT), Text Summarization and Classification, Monolingual and multilingual information retrieval, information extraction. Techniques: Word Clustering, word sense disambiguation.

Next: Word Sense Disambiguation Up: Component Technologies Previous: Word Clustering

EAGLES Central Secretariat eagles@ilc.cnr.it