Next: Text Summarization Up: Areas of Application Previous: Information Retrieval

Information Extraction

Introduction

Information extraction (IE) systems extract certain predefined information about entities and relationships between entities from a natural language text and place this information into a structured database record or template. For example, an IE system might scan business newswire texts for announcements of management succession events (retirements, appointments, promotions, etc.), extract the names of the participating companies and individuals, the post involved, the vacancy reason, and so on.

In IE there is an open question concerning the utility of generic lexical resources (e.g. WordNet, discussed in §3.4.2) versus application-dependent resources. Until now limited application-related resources have been shown to be more effective in applied systems, as most IE systems have been designed to work in very restricted domains (e.g. in the MUC competitions) and generic resources tend to be simply too general to work well in narrow, tightly defined technical domains. Application-specific resources are typically hand crafted by system developers for the domain. However, there is an increasing requirement to move IE systems to new domains rapidly and to allow end users to configure IE systems for their own purposes without the support of experts. Clearly the use (or re-use) of extensive existing resources could provide one solution to these problems, as long as the resources are adequate to the task.

Survey

The roots of IE work can be found in work by N. Sager on information formats [Sag81] and work of the Yale school on scripts and story comprehension which, from an IE perspective, culminated in the DeJong's FRUMP system [Dej82]. A number of commercial news skimming systems emerged in the 1980's (e.g. the Jasper [1] and Scisor [Jac90a] systems), but it was really with the advent of the DARPA Message Understanding Conferences (MUCs) starting in 1987 that attention focused on IE as an independent and promising NL application area [MUC3,MUC4,MUC5,MUC6,MUC7]. Currently, some of the IE systems developed through the DARPA programme are in use by the US government and component technology from the MUC evaluations, specifically named recognition software, has become commercially available (see §5.4). A number of European research projects are currently attempting to develop and deploy IE systems in a variety of domains (e.g. the FACILE, AVENTINUS, ECRAN and TREE projects - see [LE98]).

Approaches in the late 80's and early 90's attempted to use sophisticated techniques grounded in computational linguistic theory, such as full parsing, translation to logical form, and abductive theorem proving for discourse interpretation (e.g. in MUC-3 SRI's TACITUS system, NYU's Proteus system). Starting with SRI's FASTUS system in MUC-4, however, there has been a significant shift away from deep analysis systems towards systems that do shallower analysis - systems that distinguish information extraction from full text understanding and aim only for the former. These systems (see [MUC6] examples) are characterised by the use of use finite-state phrase finding techniques and in many cases direct surface form to template mapping without any intermediate `semantic' representation. Such systems have been quite successful.

For more extensive reviews of information extraction see [Cow96,Gai98].

Role of Lexical Semantics

The main sorts of lexical information required by an IE system are:

subcategorization/selectional restrictional frames (for verbs, nouns and adjectives) - many IE systems have domain-specific lexicons that store e.g. subcat patterns with highly domain-relevant verbs in such a fashion as to permit them to use the patterns directly in a pattern matching engine and then map the matched patterns directly into a template (e.g. PERSON retired as POSITION of ORGANISATION);
information related to nominalization (e.g. produce/production) or adjectivization (e.g. pagare/pagabile);
syno-, hyper-, and hyponymic lexical information to carry out coreference - coreference is necessary for IE since information to be extracted about entities is frequently distributed across sentences and linked by means of pronouns (e.g. to extract the position of Jones from the sentences Jones stepped down yesterday. He was president of Foo. Corp. we need to perform pronominal coreference);
lexical information supporting proper name and multiterm identification (§5.4 and §5.2);

Much of this information is not provided by WordNet-like resources, or at least not at the appropriate degree of specificity (and indeed virtually no use is made of exisiting general lexical semantic resources by any of the MUC systems). Tools for adapting generic lexicons to the actual needs of specific domains are necessary, as application domains often show deviations with respect to the normal use of language in terms of:

the kind of subcategorization frame: the frame may change according to specific uses; for example the verb indicare in standard Italian is a normal intransitive verb, while in the financial domain has an additional argument related to the value introduced by the preposition a (e.g. i titoli sono stati indicati al 2%); also role restrictions can change;
meaning: very often words assume additional meanings in specific domains; for example the verb indicare in italian means ``to point''; but in the finacial domain it is used to introduce prices for listed shares;
familiarity: the verb to index, for example, is considered rare in standard English (it is not even listed in WordNet), but is very familiar in finance and computer science.

The use of generic resources in analysing texts in restricted domains also introduces the problem of the relation between the domain description available or needed by the system (in order to reason about the extracted information; i.e., the knowledge base) and the generic lexical semantic definition given by the generic resources. Partial overlaps can be found, but the domain specific description is likely to be more precisely defined and reliable.

As a result of these difficulties with existing generic resources, IE system builders have tended to handcraft resources for each application domain, or have looked at techniques for automatically or semi-automatically constructing lexicons of various sorts from (possibly annotated) texts in the domain. See, e.g. [Ril93a] and [Kru95].

Related Areas and Techniques

Related areas of research are: Information Retrieval (§4.2), Summarization (§4.4) and Text Classification.

Next: Text Summarization Up: Areas of Application Previous: Information Retrieval

EAGLES Central Secretariat eagles@ilc.cnr.it