next up previous contents
Next: Text Summarization Up: Areas of Application Previous: Information Retrieval


Information Extraction


Information extraction (IE) systems extract certain predefined information about entities and relationships between entities from a natural language text and place this information into a structured database record or template. For example, an IE system might scan business newswire texts for announcements of management succession events (retirements, appointments, promotions, etc.), extract the names of the participating companies and individuals, the post involved, the vacancy reason, and so on.

In IE there is an open question concerning the utility of generic lexical resources (e.g. WordNet, discussed in §3.4.2) versus application-dependent resources. Until now limited application-related resources have been shown to be more effective in applied systems, as most IE systems have been designed to work in very restricted domains (e.g. in the MUC competitions) and generic resources tend to be simply too general to work well in narrow, tightly defined technical domains. Application-specific resources are typically hand crafted by system developers for the domain. However, there is an increasing requirement to move IE systems to new domains rapidly and to allow end users to configure IE systems for their own purposes without the support of experts. Clearly the use (or re-use) of extensive existing resources could provide one solution to these problems, as long as the resources are adequate to the task.


The roots of IE work can be found in work by N. Sager on information formats [Sag81] and work of the Yale school on scripts and story comprehension which, from an IE perspective, culminated in the DeJong's FRUMP system [Dej82]. A number of commercial news skimming systems emerged in the 1980's (e.g. the Jasper [1] and Scisor [Jac90a] systems), but it was really with the advent of the DARPA Message Understanding Conferences (MUCs) starting in 1987 that attention focused on IE as an independent and promising NL application area [MUC3,MUC4,MUC5,MUC6,MUC7]. Currently, some of the IE systems developed through the DARPA programme are in use by the US government and component technology from the MUC evaluations, specifically named recognition software, has become commercially available (see §5.4). A number of European research projects are currently attempting to develop and deploy IE systems in a variety of domains (e.g. the FACILE, AVENTINUS, ECRAN and TREE projects - see [LE98]).

Approaches in the late 80's and early 90's attempted to use sophisticated techniques grounded in computational linguistic theory, such as full parsing, translation to logical form, and abductive theorem proving for discourse interpretation (e.g. in MUC-3 SRI's TACITUS system, NYU's Proteus system). Starting with SRI's FASTUS system in MUC-4, however, there has been a significant shift away from deep analysis systems towards systems that do shallower analysis - systems that distinguish information extraction from full text understanding and aim only for the former. These systems (see [MUC6] examples) are characterised by the use of use finite-state phrase finding techniques and in many cases direct surface form to template mapping without any intermediate `semantic' representation. Such systems have been quite successful.

For more extensive reviews of information extraction see [Cow96,Gai98].

Role of Lexical Semantics

The main sorts of lexical information required by an IE system are:

Much of this information is not provided by WordNet-like resources, or at least not at the appropriate degree of specificity (and indeed virtually no use is made of exisiting general lexical semantic resources by any of the MUC systems). Tools for adapting generic lexicons to the actual needs of specific domains are necessary, as application domains often show deviations with respect to the normal use of language in terms of:

The use of generic resources in analysing texts in restricted domains also introduces the problem of the relation between the domain description available or needed by the system (in order to reason about the extracted information; i.e., the knowledge base) and the generic lexical semantic definition given by the generic resources. Partial overlaps can be found, but the domain specific description is likely to be more precisely defined and reliable.

As a result of these difficulties with existing generic resources, IE system builders have tended to handcraft resources for each application domain, or have looked at techniques for automatically or semi-automatically constructing lexicons of various sorts from (possibly annotated) texts in the domain. See, e.g. [Ril93a] and [Kru95].

Related Areas and Techniques

Related areas of research are: Information Retrieval (§4.2), Summarization (§4.4) and Text Classification.

next up previous contents
Next: Text Summarization Up: Areas of Application Previous: Information Retrieval
EAGLES Central Secretariat