Advancements in economic integration are now progressively characterizing the European community as a Multilingual Information Society in which full participation increasingly relies on accurate and immediate access, consumption, exchange and dissemination of knowledge in a variety of languages. SPARKLE's main objective is to address this requirement by building on ongoing developments in Telematics, Computing and Language Engineering to develop robust and portable tools leading to commercial applications devoted to the management of multilingual information in electronic form.
The development of language models for real-world NLP applications requires flexible tools for semi-automatic induction of linguistic knowledge from text corpora. The SPARKLE consortium plans to satisfy this requirement through the achievement of the following goals. First, software tools will be developed which are able to produce a phrasal-level syntactic analysis of naturally occurring free text which can be easily parameterised by language. Technology for shallow parsing of unrestricted text--e.g. newspapers, technical documentation, and text data accessible on the Internet through the World Wide Web--represents an attainable and desirable step forward in the practical development of language engineering technology. For example, shallow parsers are essential in the derivation of NLP lexicons from text corpora and can be used in a variety of applications such as document classification and retrieval as well as machine translation. In SPARKLE, shallow parsing technology will be developed and applied to English, French, German and Italian; extensibility to other languages will also be contemplated. From the outset, the parsers will be able to deal with an unrestricted vocabulary, thus allowing for robust processing of corpora of tens of millions of words.
The second goal of SPARKLE is to develop a lexical acquisition system capable of learning the aspects of word knowledge from free text which are needed for language engineering applications. The creation of such tools will make it possible to build sufficiently rich NLP lexicons in a cost-effective manner. Language Engineering applications, ranging from superficial text critiquing to machine translation and speech dialogue, need information about words. To attain practicality and habitability, such systems must be furnished with a substantial lexicon covering a realistic vocabulary and providing the kinds of linguistic knowledge appropriate to the application. Unfortunately, the lexical component still remains a major bottleneck for current NLP systems. This is basically due to the relative lack of effectiveness of existing automatic methods for lexicon acquisition, and the labour intensive nature of encoding lexical entries manually. If we assume that the task of developing an adequate core lexicon is equivalent to that of developing a conventional learners' dictionary, then the labour required runs into hundreds of person-years. SPARKLE addresses this problem by developing a system for (semi-)automatic lexical acquisition which uses shallow parsing to improve on the accuracy and effectiveness of knowledge extraction from text data. More precisely, the SPARKLE acquisition system will work on the output of the shallow parsers built in the first phase of the project to extract lexical knowledge about semantic classes of predicates, subcategorization, argument structure, preferential selectional restriction and diathesis alternations for the languages of focus. The lexicon acquisition system will thus be parameterisable for each choice of language, making it possible to incorporate language-independent as well as language-specific linguistic knowledge.
A further goal of the project will be to compare the relative success of different approaches to the tasks of robust parsing and lexical acquisition both intra- and inter-linguistically. This goal will be achieved through a process of verification and validation which will identify appropriate techniques capable of yielding practical solutions to both tasks in terms of accuracy and ease of parameterisability. The parsers and lexicons produced in the project will be used by the industrial partners to build pilot applications in the areas of multilingual information retrieval and speech dialogue. The multilingual information retrieval systems, developed by Xerox and Sharp, will translate phrases to allow a user to interrogate a foreign data base in his/her own language. Performance of the systems will be evaluated by the standard criteria of information retrieval, against a baseline system which does not use the lexicon data and parsing tools developed in the project. The speech dialogue system, developed by Daimler Benz for German, will also make use of SPARKLE's parsers and lexica to generate a probabilistic language model for speech recognition.