Next: Parsing Up: Component Technologies Previous: Word Sense Disambiguation

Proper Noun Recognition and Classification

Introduction

Recognising and classifying proper nouns involves identifying which strings in a text name individuals and which classes these individuals fall into. Typical name classes include organisations, persons, locations, dates and monetary amounts. However, further classes can include book and movie titles, product names, restaurant and hotel names, ship names, etc. The task is made difficult by the unpredictable length of names (company names can be twelve or more words long), ambiguity between name classes (Ford can be a company, a person, or a location), embedding, where e.g. a location name occurs within an organisation name, variant forms, and unreliability of capitalisation as a cue, e.g. in headlines in English and everywhere in German.

Being able to recognise and classify proper names correctly clearly has relevance for a number of application areas:

precision in IR systems should increase if multiword names are treated as unitary terms and if variant forms can be linked;
IE systems rely heavily on PN recognition and classification as MUC-6 results have shown [MUC6].

Survey of Approaches

The MUC-6 and MUC-7 Named Entity (NE) Task and Multilingual Entity Task (MET) alone have stimulated the creation of more than thirty name recognition systems for a variety of languages (English, Spanish, Chinese, Japanese and Thai), and doubtless many more systems exist as well. The approaches adopted in these systems range from the more or less purely statistical (e.g. the use of Hidden Markov or Maximum Entropy models - BBN) to the more or less purely rule based (Sheff/Mitre), as well as a number lf hybrid approaches which mix statistical with rule-based techniques (LTG/NYU). All systems, however need to make us of two types of evidence - what McDonald calls name internal evidence and name external evidence.

Relevant Notions of Lexical Semantics

Most PN recognition systems use lexical resources such as gazetteers (lists of names) but these are necessarily incomplete, as new names are constantly coming into existence. Therefore further techniques must be used, including syntactic analysis and semantic classification based on verb and preposition complement roles.

Next: Parsing Up: Component Technologies Previous: Word Sense Disambiguation

EAGLES Central Secretariat eagles@ilc.cnr.it