In the EU in particular, it is felt that progress in NLP and speech applications is hampered by a lack of generic technologies and of large-scale language resources, by a proliferation of different information formats, by variable linguistic specificity of existing information and by the high cost of development of resources.
In this section, we look briefly at why large-scale infrastructure resources are necessary to enable further progress in development and delivery of language engineering products.
Examples of such infrastructure resources include dictionaries, corpora, test suites and grammars. With the maturation of language technology, there come concerns about such matters as scaling-up, robustness, coverage, task suitability and accuracy.
It is generally true to say that language engineering technology was until very recently largely developed according to a methodology which involved:
The many systems produced according to such a methodology tend to fall into one of two classes:
Lengthy experimentation with systems of the second type can lead to the discovery of areas in which they may be of more use than others. Much depends on the purpose to which systems are put, e.g. one may be satisfied with poor performance where one is not overly concerned with high quality.
The development of systems based on linguistic theory has mitigated to some extent many of the problems that were caused by earlier systems whose ad hoc linguistic descriptions were a cause of much erratic behaviour. However, adoption of a theoretically based approach does not in itself guarantee a robust and accurate system, if the data one is modelling in one's theory are limited in extent and variety -- or remain undifferentiated even though they reflect different language behaviour.
Today, it is accepted that language technology must be based on adequate linguistic knowledge about the types of messages to be processed. This implies deep and detailed knowledge about language structure, vocabulary, the role of context, collocation, discourse, stress, intonation, pronunciation, interaction in conversational situations and a host of other aspects of language behaviour, not forgetting those aspects concerned with multilingual communication.
In the last few years, therefore, there has been a rapidly increasing interest in building and sharing very large scale linguistic resources for both speech and natural language processing.
Resources such as corpora reflect actual usage of language in a variety of situations, for a variety of purposes. They provide documented evidence of language behaviour. As such, they are primary resources compared to, for example, dictionaries or grammars which are secondary, derived resources. Those working in speech have been accustomed to using corpora for system training purposes for some time now. In natural language processing, extensive use of corpora is of recent date. In lexicography, there is a long tradition of corpus-based work for a small number of languages. Only recently, however, has the need for very large-scale corpora come to the fore.
It is important to note that there is a pressing need not only for text corpora, but also for spoken language corpora. This latter type comes in many different varieties (as do text corpora). Given the amount of speech data -- with associated derived annotations and analyses -- required to support the development of a speech system even for some limited task, it is clear that the speech field, as the natural language field, is heavily reliant on corpus resources for successful progress.
Thus, no matter whether one is engaged in symbolic or statistically based approaches to Language Engineering, or is engaged in speech or natural language R&D (or both, as is increasingly common) large corpora are seen today as necessary resources for the acquisition of knowledge about language and language behaviour.
Computational lexicons represent another area where a pressing need is felt for the provision of large scale resources. Almost every language technology system requires a lexicon. Today's linguistic theories place great stress on the lexicon: much detailed information is required. It is expensive to produce such resources with the level of detail required. Attempts have been made to process existing publishers' machine readable dictionaries to extract the required information, however, as these are invariably oriented towards human needs and contain relatively unformalised information, success has been variable.
Language technology systems that are intended to process real messages in everyday working environments have a high requirement to have, insofar as feasible, a fully up-to-date dictionary that offers exhaustive coverage of the target domain in the depth required to support linguistic processing.
Thus, large scale lexicons are required, from which subsets of data may be selected for particular applications. There is an intimate link between corpus and lexicon: in order to have a reasonable lexicon, one must base it on a corpus. This does not preclude the augmentation of information through human intuition. The volumes of data and information involved are such that it is difficult for any one body to identify, collect, process, select, augment, disseminate and maintain lexical information to support language technology. It is quite impossible for the type of small to medium sized company that, in Europe, is the typical provider of language technology, to engage in endeavours related to linguistic resources on this scale.
For several years now, the lexicon has been recognised as the major bottleneck preventing widespread development of language technology. This is as much true for natural language processing as for speech processing.
Furthermore, as part of product development, producers must be able to assess and evaluate their language technology products. End-users have similar needs to determine how useful and cost-effective some product may be for their purposes. Such requirements thus lend further motivation to establishing corpora and other types of collections of data, for test purposes. A particular kind of testing utility is the test suite: an exhaustive systematic compilation of instances of language behaviour against which the performance of a system may be tested.
There is thus a widespread need for large scale infrastructure resources, for many reasons. This need is directly related to the work of EAGLES: without standard means of designing, constructing, manipulating, accessing and disseminating such resources, little progress will be made towards meeting this need. At the heart of this whole area lies the notion of reusability, to which we turn next.
Interest in reusability arose due to the significant costs involved, for industry, in building the large-scale infrastructure resources that are necessary for the development, testing and maintenance of successful language technology products.
The cost and effort of developing large-scale language resources and the tools to manipulate them are high. It is uneconomic to build a resource for the development of one application only, or a sophisticated tool that is tied to a particular resource. There is thus great interest in the design of resources and tools such that these can be reused. A large lexicon, for example, richly annotated with linguistic information to support speech and natural language processing, is only worth building if it can be used to support a range of applications both now and in the future. If one has a small resource, one might prefer to acquire extra information from elsewhere, if possible, rather than duplicate existing work at perhaps greater cost. However, such acquisition can only take place if the information desired can be accessed and readily interpreted.
At the core of this problem lies the crucial notion of reusability. There are two interpretations of reusability which are highly relevant in Language Engineering:
An example of a reusable resource under the first interpretation is a publisher's machine readable dictionary, which is conceived originally for human use but whose contents are processed in order to extract information for the dictionary of a natural language processing system.
An example under the second interpretation is a lexicon designed to support theory-based language engineering applications, whose contents may be mapped into particular monolingual or multilingual application dictionaries expressed according to the principles of some particular theory. Such a dictionary is variously referred to as a neutral dictionary or a polytheoretical dictionary.
This dual nature of reusability was first articulated by J. McNaught at a meeting in Dublin in 1987 with S. Perschke and A. Zampolli, in relation to lexical resources. For further discussion of reusability, see Calzolari (1990) and Heid and McNaught (1991).
It is apparent that, given the volumes of lexical data to be processed for even a modest resource and the time and cost involved, reusability, in the second sense, is effectively a mandatory requirement for any future resource. Furthermore, the following is also apparent:
Language Engineering is a difficult area which is, moreover, in a state of transition, as designers move from largely ad hoc systems to systems based on the results of research into linguistic theory. We are thus faced with the need to represent language data and knowledge about language at various levels of abstraction, as well as with the need to interpret information within some theoretical framework. Substantial work has already been done concerning standardisation in relation to physical representation. Indeed, standardisation efforts promoted over the last decade by the IT industries and by international and European standards bodies, such as ISO, CCITT and CEN-CENELEC, have led to stable means for the encoding of data. Widely used norms exist for the representation of textual data, covering multilingual character sets up to powerful metalanguages for document structuring and description. The US-European Text Encoding Initiative (TEI) (Sperberg-McQueen and Burnard, 1994) has made considerable progress in tailoring such metalanguages to the many types of electronic text.
What is far more problematic, especially when faced with competing theories, is how we agree on a core set of basic notions, and then how one interprets the meaning of some physical representation which ultimately refers to some abstract entity: for example, a linguistic label, whose name has been agreed upon -- different theories have different views as to the meaning to be attached to this label. We are thus faced with a dual problem when attempting to set up standards in Language Engineering: we have to deal with physical format as well as with interpretation of the format. Moreover, we have to deal with various types of information. Linguistic knowledge is traditionally expressed according to linguistic level: phonetic, phonological, morphological, syntactic, semantic, pragmatic and so on. Each level has many theories or approaches associated with it. Our knowledge of each level differs in terms of completeness and adequacy and our ability to formalise what we do know. No level can be said to be completely understood and formalisable. In general, more confidence is felt regarding our ability to handle `lower' levels such as phonetics and morphology, whereas, for the `higher' levels such as semantics and pragmatics, there is a feeling that much research still needs to be done to provide even minimally workable solutions. This is a gross characterisation: the detailed picture is much too complex to discuss in depth here, especially when one thinks of approaches which cut across traditional linguistic level boundaries and theories which take a holistic view of language.
There are also needs to integrate products in present or future language technology environments, to interface them with various types of information system, to expand them to different domains and languages and also to maintain and, indeed, further develop them.