Next: Natural Language Generation Up: Areas of Application Previous: Information Extraction

Text Summarization

Introduction

With the proliferation of online textual resources, an increasingly pressing need has arisen to improve online access to textual information. This requirement has been partly addressed through the development of tools aiming at the automatic selection of document fragments which are best suited to provide a summary of the document with possible reference to the user's interests. Text summarization has thus rapidly become a very topical research area.

Survey

Most of the work on summarization carried out to date is geared towards the extraction of significant text fragments from a document and can be classified into two broad categories:

domain dependent approaches where a priori knowledge of the discourse domain and text structure (e.g. weather, financial, medical) is exploited to achieve high quality summaries, and
domain independent approaches where a statistical (e.g. vector space indexing models) as well as linguistic techniques (e.g. lexical cohesion) are employed to identify key passages and sentences of the document.

Considerably less effort has been devoted to ``text condensation'' treatments where NLP approaches to text analysis and generation are used to deliver summary information of the basis of interpreted text [McK95].

Domain Dependent Approaches

Several domain dependent approaches to summarization use Information Extraction techniques ([Leh81,Ril93]) in order to identify the most important information within a document. Work in this area includes also techinques for Report Generation ([Kit86]) and Event Summarization ([May93]) from specialized databases.

Domain Independent Approaches

Most domain-independent approaches use statistical techniques often in combination with robust/shallow language technologies to extract salient document fragments. The statistical techniques used are similar to those employed in Information Retrieval and include: vector space models, term frequency and inverted document frequency ([Pai90,Rau94,Sal97]). The language technologies employed vary from lexical cohesion techniques ([Hoe91,Bar97]) to robust anaphora resolution ([Bog97]).

Role of Lexical Semantics

In many text extraction approaches, the essential step in abridging a text is to select a portion of the text which is most representative in that it contains as many of the key concepts defining the text as possible (textual relevance). This selection must also take into consideration the degree of textual connectivity among sentences so as to minimize the danger of producing summaries which contain poorly linked sentences. Good lexical semantic information can help achieve better results in the assessment of textual relevance and connectivity.

For example, computing lexical cohesion for all pair-wise sentence combinations in a text provides an effective way of assessing textual relevance and connectivity in parallel [Hoe91]. A simple way of computing lexical cohesion for a pair of sentences is to count non-stop (e.g. closed class) words which occur in both the sentences. Sentences which contain a greater number of shared non-stop words are more likely to provide a better abridgement of the original text for two reasons:

the more often a word with high informational content occurs in a text, the more topical and germane to the text the word is likely to be, and
the greater the times two sentences share a word, the more connected they are likely to be.

The assessment of lexical cohesion between text units can be improved and enriched by using semantic relations such as synonymy, hyp(er)onymy [Hoe91,Mor91,Hir97,Bar97] as well as semantic annotations such as subject domains [SanFCb] in addition to simple orthographic identity.

Related Areas and Techniques

Related areas of research are: Information Retrieval, Information Extraction and Text Classification.

Next: Natural Language Generation Up: Areas of Application Previous: Information Extraction

EAGLES Central Secretariat eagles@ilc.cnr.it