Next: Multiword Recognition and Extraction Up: Component Technologies Previous: Component Technologies

Word Clustering

Introduction

Word clustering is a technique for partitioning sets of words into subsets of semantically similar words and is increasingly becoming a major technique used in a number of NLP tasks ranging from word sense or structural disambiguation to information retrieval and filtering.

In the literature, two main different types of similarity have been used which can be characterised as follows:

1.: paradigmatic, or substitutional, similarity: two words that are paradigmatically similar may be substituted for one another in a particular context. For example, in the context I read the book, the word book can be replaced by magazine with no violation of the semantic well-formedness of the sentence, and therefore the two words can be said to be paradigmatically similar;
2.: syntagmatic similarity: two words that are syntagmatically similar significantly occur together in 2text. For instance, cut and knife are syntagmatically similar since they typically co-occur within the same context.

Hereafter we will refer to type 1. similarity as ``semantic similarity'' tout court while referring to type 2. similarity more loosely as ``semantic relatedness''. As we will see, the two aspects are obviously interconnected (and in fact they are often not kept apart in the literature) although we find it useful for classificatory purposes to maintain the distinction.

Both types of similarity, computed through different methods, are used in the framework of a wide range of NLP applications.

Survey of Approaches

The methods and techniques for clustering words that we will consider here will be parted according to the kind of similary they take into account:

semantic similarity;
semantic relatedness.

Semantic similarity

Typically, semantic similarity is computed either on the basis of taxonomical relationships such as hyperonymy and synonymy or through distributional evidence. The former approach presupposes prior availability of independent hierarchically structured sources of lexico-semantic information such as WordNet [Mil90a]. With the latter approach, the semantic similarity between two words W1 and W2 is computed on the basis of the extent to which their typical contexts of use overlap. The notion of taxonomy-based semantic similarity crucially hinges on words' membership of more general semantic classes. By contrast, distributionally-based semantic similarity rests on the assumption that words entering into the same syntagmatic relation with other words can be seen as semantically similar, although in this case the similarity may be grounded on some covert properties orthogonal to their general semantic superclass.

Taxonomy-based semantic similarity

In this section, methods which assess semantic similarity with respect to hierarchically structured lexical resources will be discussed.

In Rada et al. [Rad89] and Lee et al. [Lee93], the assessment of semantic similarity is done with respect to hyperonymy (IS-A) links. Under this approach, semantic similarity is evaluated in terms of the distance between the nodes corresponding to the items being compared in a taxonomy: the shorter the path from one node to another, the more similar they are. Given multiple paths, the shortest path is taken as involving a stronger similarity.

Yet, several problems are widely acknowledged with this approach. First, other link types than hyperonymy can usefully be exploited to establish semantic similarity. For instance, Nagao [Nag92] uses both hyperonymy and synonymy links to compute semantic similarity where higher similarity score is associated with synonymy; with hyperonymy, similarity is calculated along the same lines as Rada et al. [Rad89], i.e. on the basis of the length of the path connecting the compared words. Yet other scholars have attempted to widen the range of relationships on the basis of which semantic similarity can be computed; see, among others, [Nir93b] who also use morphological information and antonyms.

Another problem usually addressed with the path-length similarity method is concerned with the underlying assumption that links in a taxonomy represent uniform distances, which is not always the case. In real taxonomies, it has been noted that the ``distance'' covered by individual taxonomic links is variable, due to the fact that, for instance, certain sub-taxonomies are much denser than others. To overcome the problem of varying link distances, alternative ways to evaluate semantic similarity in a taxonomy have been put forward.

Agirre and Rigau [Agi96] propose a semantic similarity measure which, besides the length of the path, is sensitive to

the depth of the nodes in the hierarchy (deeper nodes are ranked closer), and
the density of nodes in the sub-hierarchies involved (concepts in a denser subhierarchy are ranked closer than those in a more sparse region).

This similarity measure is referred to as ``conceptual density'' measure.

Resnik [Res95] defines a taxonomic similarity measure which dispenses with the path length approach and is based on the notion of information content. Under his view, semantic similarity between two words is represented by the entropy value of the most informative concept subsuming the two in a semantic taxonomy, the WordNet lexical database (see §3.4.2) in the case at hand. For example, (all senses of) the nouns clerk and salesperson in WordNet are connected to the first sense of the nouns employee, worker, person so as to indicate that clerk and salesperson are a kind of employee which is a kind of worker which in turn is a kind of person. In this case, the semantic similarity between the words clerk and salesperson would correspond to the entropy value of employee which is the most informative (i.e. most specific) concept shared by the two words.

The information content (or entropy) of a concept c -- which in WordNet corresponds to a set of such as fire_v_4, dismiss_v_4, terminate_v_4, sack_v_2 -- is formally defined as . The probability of a concept c is obtained for each choice of text corpus or corpora collection K by dividing the frequency of c in K by the total number of words Wobserved in K which have the same part of speech p as the word senses in c :

The frequency of a concept is calculated by counting the occurrences of all words which are potential instances of (i.e. subsumed by) the concept. These include words which have the same orthography and part of speech as the synonyms defining the concept as well as the concept's superordinates. Each time a word W_p is encountered in K, the count of each concepts c_p subsuming W_p (in any of its senses) is increased by one:

The semantic similarity between two words W1_p, W2_p is expressed as the entropy value of the most informative concept c_p which subsumes both W1_p and W2_p, as shown below.

The specific senses of W1_p, W2_p under which semantic similarity holds is determined with respect to the subsumption relation linking c_p with W1_p, W2_p . Suppose for example that in calculating the semantic similarity of the two verbs fire, dismiss using the WordNet lexical database we find that the most informative subsuming concept is represented by the synonym set containing the word sense remove_v_2. We will then know that the senses for fire, dismiss under which the similarity holds are fire_v_4 and dismiss_v_4 as these are the only instances of the verbs fire and dismiss subsumed by remove_v_2 in the WordNet hierarchy.

Such an approach, combining a taxonomic structure with empirical probability estimates, provides a way of tailoring a static knowledge structure such as a semantic taxonomy onto specific texts.

Distributionally-based semantic similarity

The thrust of distributionally-based semantic similarity rests on the idea that the semantic content of a target word T can to a large extent be characterised in terms of (a description of) how T accompanies with other words in a corpus. Two target words are then considered to be semantically similar if they consort with a critical set of common words. Approaches differ in terms of i) how the notion of distribution is formally characterised and ii) what distance metric is adopted to assess the proximity of two distributions.

In the work reported by [Bro91], each target word T_i is characterised by the words that immediately follow it in a text. More formally, the context of T_i is denoted by a vector C(T_i) defined as follows:

where counts how often w_j follows T_i in a reference corpus. For any two such vectors C(T_i) and C(T_k), [Bro91] define a distance metric measuring their pairwise distance in terms of (minimal loss of) Average Mutual Information (AMI), where AMI is the value averaging the mutual information of individual word pairs. The idea at the basis of their clustering method is to find groups in which the loss of average mutual information is small. In general, the loss is smaller when the members of the group have similar vectors.

In related work, words are not clustered in terms of frequency counts for the word immediately following T, but rather by looking at a text window of about one thousand words surrounding T. The problem with this and similar approaches is that they are considerably greedy without being completely reliable. Since it is hard to know a priori which word to pull out of the context of a target word to best characterise its sense contextually, blind algorithms are deployed that take into account the largest possible environment around T. Still, this approach can be proved to fail to take into account all relevant contextual evidence, since relevant words are shown by [Gal92a] to be possibly found as far away as 1000 words from the target word.

An attempt to characterise word usage on a sounder linguistic basis is reported by [Per92] who work on examples of verb-object relation found in a corpus. The vector associated with a target noun n_k to characterise its sense contains the (conditional) distribution of verbs for which it serves as direct object in a corpus. The metric used in this study is relative entropy and the representation of the resulting classes is a tree structure of groups.

A further development of this approach is reported in Grefenstette, who characterises the target word in terms of more than one syntactic relation, including subject, object, indirect-object and modifier. For each target noun T_i the resulting representation is a vector C(T_i) of counts for every other word w_j and each of the four possible syntactic relations between T_i and w_j. [Sek92] apply a simplified version of the same technique to triples with the following structure:

[Headword, MODIFICATION, Modifier]

The triples are automatically extracted from a shallow-parsed corpus. Accordingly two target words are taken to be semantically close if they occur as modifiers of the same headword. The authors, unfortunately, do not use this technique to construct word classes.

[Fed96,Fed97] make the same assumption as [Sek92] but apply it to examples of verb-object and verb-subject relations. Distributionally-based semantic similarity is computed on both terms of these relations. Thus, verbs are taken to be semantically similar if they share some words serving as direct object or subject. By the same token, nouns are semantically similar if they serve as subject and/or object for a shared subset of verbs. A notion of semantic proportional analogy is developed resting on both types of word clustering: this notion gives, for a given target word in context, its closest semantic analogue(s). Unlike all approaches considered so far, whereby word clusters are defined once and for all by averaging out counts for words in context, semantic proportional analogy varies depending on the context. Thus, the same word T selects a different closest analogue depending whether T is considered in the company of the object O_i or the object O_j. This technique is reported to be robust enough to be able to overcome both noisy and sparse data problems.

Semantic relatedness

As we saw above, two words can be considered to be semantically related if they typically co-occur within the same context. In the section on distributionally-based semantic similarity, we showed that word co-occurrence patterns are instrumental for identifying clusterings of semantically similar words. In this section, word co-occurrence patterns will be considered as such together with their use for the resolution of a number of NLP problems.

In the literature, various scholars have suggested that ambiguity resolution (e.g. prepositional phrase attachment) is strongly influenced by the lexical content of specific words ([For82,Mar80] to mention only a few). Yet, for this assumption to be assessed in practice, the necessary information about lexical associations was to be acquired. Recently, an upsurge of interest in corpus-based studies led a number of scholars to collect lexical co-occurrence data from large textual corpora; among them - to mention only a few - [Chu89,Cal90,Hin93,Sma93,Rib94,Tsu92]. Collected patterns were successfully used for different disambiguation purposes. For instance, [Hin93] show how structural ambiguities such as prepositional phrase attachment can be resolved on the basis of the relative strength of association of words, estimated on the basis of distribution in an automatically parsed corpus. Similar results are reported in [Tsu92].

Relevant notions of lexical semantics

All types of word clustering reviewed so far have been widely studied from the theoretical point of view in the field of lexical semantics where it is commonly assumed that the semantic properties of a lexical item are fully reflected in the relations it contracts in actual and potential linguistic contexts, namely on the syntagmatic and paradigmatic axes.

Sense relations are of two fundamental types: paradigmatic and syntagmatic. Paradigmatic relations such as synonymy, hyperonymy/hyponymy, antonymy or meronymy occupy focal positions in discussions of lexical semantics (see, for instance, [Cru86]). They reflect the way reality is categorised, subcategorised and graded along specific dimensions of lexical variation. By contrast, syntagmatic aspects of lexical meaning form a less prominent topic of lexical semantics which in the literature is generally referred to as co-occurrence restrictions. Two different kinds of co-occurrence restrictions are commonly distinguished: selection restrictions, which are defined in very general terms as ``any of various semantic constraints reflected in the ability of lexical items to combine in syntactic structures'' [Tra94]; collocational restrictions, which are ``unusually idiosyncratic or language- specific'' (ibidem) restrictions.

Nowadays, a new trend in the lexical semantics literature is emerging which questions the watertight classification of meanings presupposed by traditional approaches to lexical semantics. Under this view, word senses are defined in terms of clear, typical, central cases (called ``prototypes'') rather than in terms of their boundaries which very often appear as unclear and undefinable (see [Cru94]).

NLP applications using word clustering techniques

The various word clustering techniques described so far have a large number of potentially important applications, including:

helping lexicographers in identifying normal and conventional usage;
helping computational linguists in compiling lexicons with lexico-semantic knowledge;
providing disambiguation cues for:
- parsing highly ambiguous syntactic structures (such as noun compounds, complex coordinated structures, complements attachment, subject/object assignment for languages like Italian);
- sense identification;
retrieving texts and/or information from large databases;
constraining the language model for speech recognition and optical character recognition (to help disambiguating among phonetically or optically confusable words).

Next: Multiword Recognition and Extraction Up: Component Technologies Previous: Component Technologies

EAGLES Central Secretariat eagles@ilc.cnr.it