next up previous
Next: Implementation Up: Project Description Previous: Project Description

Objectives of the Project

The first goal of SPARKLE is to produce generic software able to reliably produce a unique, correct but simple phrasal-level syntactic analysis of naturally-occurring free text. This software will be capable of practical use for processing of substantial quantities of such (corpus) material. Such phrasal-parsers will be generic in the sense that they aim to be compatible with a variety of extant approaches to lemmatisation, morphological analysis and lexical syntactic tagging and aim to be straightforwardly parameterisable for different (European) languages.

The second goal is to develop a lexical acquisition system capable of learning subcategorisation, argument structure and semantic selection preferences for individual predicates from free text containing instances of such predicates. The lexicon acquisition system will also be developed as a parameterisable multilingual software tool incorporating language-independent and-dependent linguistic knowledge concerning membership of predicates in broad semantic classes, (diathesis) alternations, the linking of arguments to thematic relations.

The work devoted to producing generic parsing software and parameterisable lexicon acquisition tools addresses the general question of how to develop advanced methods for language processing (task LE 3.3) with specific reference to robust parsing and machine-aided translation in information retrieval systems. This work can be directed at a variety of commercial applications especially in the areas of access to information and transaction services (task LE 1.7) and machine-aided translation (task LE 1.11). Its practicality will be demonstrated - within SPARKLE - by use in extending the capabilities of existing translation, speech understanding and information retrieval applications systems. More specifically, we propose to apply the techniques developed in the project to the domain of selection of, navigation through and translation of multilingual information available through telematic systems and services -- eg online directories, news, financial transactions, home shopping, travel planning and other services delivered in digital form through electronic mail. An extension to this application will add speech-driven access to the system, supporting remote access by telephone.

In applications such as document classification and retrieval where relatively crude semantic (topic) classifications are required, and where recognition and semantic classification of phrases represents a potential alternative to techniques such as keyword indexing, robust phrasal parsing is an essential prerequisite to more focussed and linguistically motivated word sense disambiguation techniques, and to manipulation and indexing of key phrases.

To give a practical example, recognizing fixed terminology, e.g. terms realized as phrases and fixed argument structure, is important for comparing documents, since these structures carry more information than individual words, the traditional units compared in Information Retrieval. In multilingual information retrieval, recognizing fixed expressions is even more important than in the monolingual case, as the translation of a technical term is often not equivalent to the translation of its components.

In addition to noun phrase recognition and translation, the techniques developed in this project will be useful for argument specific translations, in which the translation of a given verb depends on its arguments. Being able to recognize argument structure in a text can help reduce the number of translations possible for a given word, and reduce the noise produced in multilingual translations due to the fact that different meanings are realized by different argument assignment.

Speech dialogue systems will soon be in the position of providing services such as data base access via telephone. So far they have not been offered to the public since cost of employing the necessary personnel makes these services financially prohibitive. The time scale for such an advance is clearly dependent of how reliable speech technology can be made in the near future. Robust parsing technology has a direct bearing on this issue. It is now largely accepted that the enrichment of statistical language models for speech recognition with linguistic knowledge is instrumental in improving system performance. In general, such language models provide information to cut down the search space during recognition by discarding unlikely word sequences. However, even assuming a huge training corpus and either bigram or trigram models, there will be a large number of possible token patterns for which no probabilities are collected as they do not occur in the training data. One means of addressing this inadequacy is through semiautomated discovery of syntactic and semantic lexical equivalence classes. The establishment of lexical equivalences classes in a cost effective manner requires a knowledge acquisition engine which makes it possible to induce regularities about word usage from text corpora and machine readable dictionaries semiautomatically. This process of induction is only possible if robust tools for partial analysis of free text are availale.

Using the tools for robust phrasal parsing developed within the project, a lexical knowledge acquisition engine will be built and integrated with three pilot applications: Speech Dialogue, Machine-aided Translation tools for Information Retrieval Services, and Multilingual Information Retrieval. The Speech Dialogue application will be demonstrated by Daimler-Benz for German. Sharp will be in charge of the Machine-aided Translation tools for IR application, and Xerox will be responsible for the Multilingual Information Retrieval application. Both the Machine-aided Translation tools for IR and Multilingual Information Retrieval applications will be be demonstrated for English, French, German and Italian. All industrial partners will use their product prototypes to demonstrate the utility of support tools for lexical acquisition. This will ensure that the pilot applications will have real commercial value and may in the near future lead to actual products.

The project is positioned as research on the development of application dependent lexical resources to be acquired semiautomatically from texts, an area which is crucial to most NLP applications. Economically feasible development of such resources will need to be based on substantially semiautomated techniques for analysing and extracting lexical information from textual corpora, otherwise converage and/or accuracy will remain inadequate. The project intends to explore how far simple robust phrasal parsing combined with classification techniques utilising limited and manageable linguistic knowledge and statistical data from substantial corpora can ameliorate this problem in the area of predicate subcategorisation, argument structure and semantic preference; an area in which most extant conventional dictionaries, lexical databases and realistic lexicons are demonstrably weak. If this approach proves successful it will have immediate impact on existing major efforts to develop the next generation of lexical resources, including those of the industrial partners for the applications, as well as providing further focus for academic research.


next up previous
Next: Implementation Up: Project Description Previous: Project Description

Sparkle Project