next up previous contents
Next: Natural Language Generation Up: Areas of Application Previous: Information Extraction


Text Summarization


With the proliferation of online textual resources, an increasingly pressing need has arisen to improve online access to textual information. This requirement has been partly addressed through the development of tools aiming at the automatic selection of document fragments which are best suited to provide a summary of the document with possible reference to the user's interests. Text summarization has thus rapidly become a very topical research area.


Most of the work on summarization carried out to date is geared towards the extraction of significant text fragments from a document and can be classified into two broad categories: Considerably less effort has been devoted to ``text condensation'' treatments where NLP approaches to text analysis and generation are used to deliver summary information of the basis of interpreted text [McK95].

Domain Dependent Approaches

Several domain dependent approaches to summarization use Information Extraction techniques ([Leh81,Ril93]) in order to identify the most important information within a document. Work in this area includes also techinques for Report Generation ([Kit86]) and Event Summarization ([May93]) from specialized databases.

Domain Independent Approaches

Most domain-independent approaches use statistical techniques often in combination with robust/shallow language technologies to extract salient document fragments. The statistical techniques used are similar to those employed in Information Retrieval and include: vector space models, term frequency and inverted document frequency ([Pai90,Rau94,Sal97]). The language technologies employed vary from lexical cohesion techniques ([Hoe91,Bar97]) to robust anaphora resolution ([Bog97]).

Role of Lexical Semantics

In many text extraction approaches, the essential step in abridging a text is to select a portion of the text which is most representative in that it contains as many of the key concepts defining the text as possible (textual relevance). This selection must also take into consideration the degree of textual connectivity among sentences so as to minimize the danger of producing summaries which contain poorly linked sentences. Good lexical semantic information can help achieve better results in the assessment of textual relevance and connectivity.

For example, computing lexical cohesion for all pair-wise sentence combinations in a text provides an effective way of assessing textual relevance and connectivity in parallel [Hoe91]. A simple way of computing lexical cohesion for a pair of sentences is to count non-stop (e.g. closed class) words which occur in both the sentences. Sentences which contain a greater number of shared non-stop words are more likely to provide a better abridgement of the original text for two reasons:

The assessment of lexical cohesion between text units can be improved and enriched by using semantic relations such as synonymy, hyp(er)onymy [Hoe91,Mor91,Hir97,Bar97] as well as semantic annotations such as subject domains [SanFCb] in addition to simple orthographic identity.

Related Areas and Techniques

Related areas of research are: Information Retrieval, Information Extraction and Text Classification.

next up previous contents
Next: Natural Language Generation Up: Areas of Application Previous: Information Extraction
EAGLES Central Secretariat