Concerns about standardisation and reusability came to prominence in the mid 1980s, particularly among those working on lexical issues in centres carrying out research in natural language processing. Contemporary trends in linguistic theory made it clear that the lexicon would be the keystone of linguistic description for the foreseeable future. As confidence grew in being able to build large computational grammatical descriptions of language, concern also grew over their associated, much larger, lexical descriptions: if one were to change, for example, one's grammatical theory, how could one cope with the knock-on effects on the lexicon? As the lexicon is expressed in the same theoretical framework, a reworking of potentially thousands of lexical entries would be, for most projects, prohibitively expensive. This is all the more true in a European context where we have to deal with numerous languages and the mappings between them. In a grammar-oriented view of the time, formally, a lexicon was but a convenient place to gather data about the terminal elements of a grammar: in reality, there is no separation, formally, between a grammar and its terminal elements.
In the early 1980s, work had begun on extraction of information from publishers' machine readable dictionaries. Although the bulk of this work was undertaken for natural language processing purposes, there was also much interest in using such extracted information for psycholinguistic studies, for information retrieval purposes and for other tasks where lexical knowledge is of importance. Much of this extraction work was conducted for ad hoc purposes and there was little need felt for cooperation or for common frameworks. Such processing of publishers' machine readable dictionaries is still very much with us in the 1990s, although it has evolved strong concerns with reusability of the results.
However, it is only since the mid 1980s that standardisation and reusability issues have taken on increasing importance, and not just in the field of the lexicon. Nevertheless, it was the lexicon field which provided the impetus for such issues to be addressed in other fields.
The seminal event which set in motion a number of efforts towards standardisation and reusability took place in 1986: a workshop on Automating the Lexicon: Research and Practice in a Multilingual Environment, otherwise known in the field as The Grosseto Workshop, organised by N. Calzolari, LRolling, J.C. Sager, D. Walker and A. Zampolli (Walker et al., 1995). The serious nature of this workshop is evident from the list of its sponsoring bodies: the European Commission (EC), Consiglio Nazionale delle Ricerche (CNR), University of Pisa, Association for Computational Linguistics (ACL), Association for Literary and Linguistic Computing (ALLC), Association Internationale de Linguistique Appliquée (AILA) and European Association for Lexicography (EURALEX). Participation was by invitation and was organised to provide a mix of linguists, lexicographers, lexicologists, computational linguists, artificial intelligence specialists, cognitive scientists, publishers, lexical software producers, translators, funding agency representatives and representatives of professional associations, drawn from several European countries, North America and Japan. The topic of the Computational Lexicon was thus tackled from a broad perspective. Numerous recommendations arose from this workshop (cf. Walker et al. (1995) and Zampolli (1991)). These in turn led to a common statement that high priority should be assigned to the design and development of large, reusable, multifunctional, precompetitive, multilingual linguistic resources.
An explosion of follow-up meetings took place between 1986 and 1988, including:
Similar events have taken place in increasing numbers since then.
In addition, as a direct result of the Grosseto Workshop, several groups came into being:
The first of these is especially important, as it brought together for the first time representatives of the then major contemporary schools of linguistics, to investigate in detail the possibility of representing the linguistic information frequently used in parsers and generators (i.e. the major syntactic categories, subcategorisation and complementation, verb classes, nominal taxonomies, etc.) in such a way that it could be used and reused in various theoretical frameworks. Not only did this group look at different linguistic syntactic theories and their points of convergence and divergence, but it also looked into how such theories could handle phenomena in several different European languages.
The Grosseto recommendation to give priority to the design and development of large, reusable, multifunctional, precompetitive, multilingual linguistic resources has not, as yet, yielded any such resource. This is not surprising, given the degree of cooperation and level of funding needed for such enterprises. To some extent, however, there has been hesitation over the need to develop very large scale reusable resources. There are practical organisational questions to be answered regarding funding, running, access modalities, acquisition policy and so forth. There are fundamental questions to be answered regarding choice of representation for data, interpretation of representation, which levels of linguistic knowledge to recognise and how these relate within and across language descriptions. Associated questions arise regarding the nature of tools for acquisition of and access to data. The strong requirement for reusability has affected in some way or another every discussion about all such issues. Reducing the problem to a simple one of money, no-one will spend a great deal of money on building a very large scale resource if it cannot be guaranteed to be reusable, even if the case for the existence of such large resources is made and accepted. This is especially true in a European context, where little money has been available for language engineering in comparison to what is spent on e.g. certain aspects of agriculture, defence or other areas of information technology. Large-scale infrastructure projects tend to be less favoured, financially, than applications-oriented projects, especially when there is a demand for rapid development of applications. However, most language engineering applications can only be successfully developed and delivered if there is an adequate linguistic resource infrastructure in place. This fact, well-known to the community, has also been recognised for many years by the EC and by its advisory bodies.
One of the most recent confirmations of the key role of infrastructural linguistic resources is to be found in the mid term review carried out for the EC's Telematics programme:
Language Engineering requires an infrastructure of a type that is unusual in other subjects [...] Without a major attack on these infrastructural needs, applications engineering will make little progress in the language engineering field.
(Oakley, 1993, 70)
Bates and Weischedel (1993) also place great emphasis on the role of infrastructure for language processing.
It is to the credit of the EC that it took a strong interest in encouraging developments in the area of reusable linguistic resources, from an early date. Given the nature of EC programmes and the internal organisation of the EC, four projects were initiated in various frameworks at the beginning of the 1990s, looking at aspects of the reusability area. Almost immediately, these projects began to discuss ways and means of cooperating to reach common objectives, under the aegis of the EC.
As it was clear that one could not immediately undertake the development of very large scale reusable linguistic resources, these projects were all concerned with preparatory or enabling aspects:
At roughly the same time, a major project was launched within the EUREKA framework, namely GENELEX. EUREKA projects are collaborative projects involving predominantly industrial consortia, partially funded by European governments. GENELEX was set up to establish a convergence path for various lexical assets owned by the consortium into a common conceptual structure. It has been concerned with many of the issues regarding reusability that exercised EUROTRA-7 and MULTILEX, although it was not initially conceived as a standards-related project.
It can be appreciated that ACQUILEX (Calzolari and Briscoe, 1995) was concerned primarily with reusability in the first sense we have identified, whereas EUROTRA-7 and MULTILEX were concerned with reusability in the second sense, and GENELEX is concerned with both.
EUROTRA-7 (Heid and McNaught, 1991) in particular deserves mention. Its goal was to assess the feasibility of designing large scale, reusable lexical and terminological resources. A broad range of different possible sources of lexical material, and in particular of different applications requiring a lexical component, was investigated, with attention focussing on different theoretical frameworks, different needs with respect to granularity of information, depth and coverage of description, and so on, to provide an account of a variety of diverging and converging needs.
An important observation of EUROTRA-7, which led to a methodological recommendation for future actions towards developing specifications for reusable linguistic resources, was -- stated in very simplistic terms -- that different theories describe essentially the same facts, but make different generalisations and use different descriptive devices. The methodological claim deriving from this was, on the one hand, that one should go back to the most fine-grained, observable differences and phenomena, i.e. to reach an extreme atomisation of linguistic observations, and, on the other hand, that one should aim for complete explicitness of descriptive devices and should provide explicit and reproducible criteria for each observable difference taken into account.
This shared layer of granular observations would then constitute a common data pool, accounting for atomic facts, subclassified according to explicit criteria, and represented in some problem-oriented high-level formalism (a typed feature based language was favoured). The global reuse scenario defined within EUROTRA-7 foresaw such a common data pool being at the heart of a model composed of three major areas: acquisition, representation and application. Within this model, lexical data are represented in terms of the common data pool. There is an interface, through specific acquisition tools, to sources of lexical information, and another interface, through a compiler/interpreter level which takes care of recombining specific descriptive devices and of filtering relevant information, to individual application lexicons which need application specific lexical descriptions.
EUROTRA-7 was thus important in terms of carrying out work towards initial specification of a model for a reusable lexicon. In addition, this study made numerous recommendations regarding the need for work towards standards, the importance of bringing industry and academia together in collobarative ventures and the importance of setting up a forum to share knowledge and to tackle issues of community-wide relevance together.
Overlapping with EUROTRA-7, the ESPRIT MULTILEX project (MULTILEX, 1993) was concerned with proposing standards for multilingual lexicons for natural language processing. Emphasis was placed on means of exchanging and sharing lexical data. This led to work not only on architectures for lexicons, but also to work on the various possibilities for exchange, together with their various implications, for example:
These can be seen as variations on a theme. The crucial area came to be seen as the development of standard ways of encoding and interpreting lexical data. The difficult aspect has always been obtaining agreement on the interpretation of some encoding: we may agree on a name for some object, but disagree (if indeed we realise there is disagreement) over the interpretation of that named object.
MULTILEX was able to carry forward work started during EUROTRA-7. In particular, it went into greater detail regarding levels of linguistic knowledge. A traditional division of the lexicon was followed, i.e. into levels ranging from orthography and phonology through morphosyntax and syntax to semantics. An additional pragmatic level was incorporated, running orthogonal to the other levels (i.e. pragmatic information was essentially distributed over all other levels). Each level was investigated to determine whether it was ripe for standardisation -- and if so to what extent standardisation could take place, in what expected timeframe, etc.; if not, to determine what other efforts might be envisaged to push towards standardisation.
The results of MULTILEX served to confirm what had earlier been suspected (via EUROTRA-7 and the work of the Ad Hoc Working Group on the Neutral Lexicon): there was room for agreement at the levels of morphosyntax and syntax, however at higher levels (semantics) the field was neither mature enough nor rich enough, in terms of workable, generally acceptable, widely used semantic models, to allow practical standardisation. Despite its short timespan, the standards-related phase of MULTILEX was able to propose a reasonable inventory of standard labels for morphosyntax, together with a preliminary design for an architecture capable of supporting a multilingual lexicon. Although MULTILEX provided definitions for the inventory of labels and examples of their use within a typed feature formalism, this work resulted in only a partial solution to the reusability problem: what was missing was a full lexical specification, in that there was no link from the inventory to any underlying pool of subclassified linguistic phenomena at the atomic level of minimal observable difference.
However, the positive results of MULTILEX, related to standards, were that it helped to advance our knowledge; it enabled motivated decisions to be taken about which areas were worth pursuing in terms of standardisation; it provided a preliminary design for a reusable multilingual lexicon; it allowed collaboration to be established between industry, academia and appropriate funding bodies; and it offered an inventory of labels as a starting point for future work. It should be pointed out that MULTILEX, in a subsequent phase, experimented with the design and implementation of tools and lexical databases appealing to the proposed lexical standards for various purposes, work which has provided valuable insights for later projects.
GENELEX (Antoni-Lay et al., 1994) has taken a slightly different view of the reusability issue, by aiming at developing a generic, application-independent model of a lexicon, thoroughly documenting it and providing explicit guidelines on its use. The GENELEX lexicon is `theory-welcoming' in that it can adapt to competing theories, via its associated formalism. Various tools are being developed to aid in management of GENELEX dictionaries and large-scale dictionaries are being built.
GENELEX, MULTILEX and EUROTRA-7 therefore had much in common. Importantly, there was also a degree of overlap among these projects in terms of participants. Given their common interests and the growing importance of the need for standards, they came together to form a Coordination Group, under the aegis of the EC.
NERC (Network of European Reference Corpora) (Calzolari et al., 1995) was launched in 1992 with the objectives of making recommendations to the EC about the future of language corpus provision in Europe, seeking to establish Europe-wide cooperative actions in relation to large-scale resources, designing a distributed repository of such resources in the European Union, experimenting with means of collecting and disseminating such resources via networks and appropriate media, and specifying an infrastructural network of large-scale resources on the basis of existing EU networks and organisations. Thus, NERC has become a focal point for corpus-related work in the Union.
Although there have been many projects in the area of natural language processing that have been foundational or enabling projects with respect to the present initiative, there has also been highly significant input from the speech community, in the form of the ESPRIT SAM project (Speech Input/Output Assessment Methodology and Standardisation) (SAM, 1992) This significant project covered many aspects of speech and resulted in a wealth of documents on standards related aspects (housed at UCL, London) and numerous standard tools for speech (notably the SAM Workstation), among other valuable contributions.
On the basis of SAM alone, there is every justification for claiming that the achievement of the spoken language community in the EU in developing solid codes of working practice for multinational, multilingual projects is unparallelled.
The situation sketched above created the appropriate conditions for the launching, firstly, of a project definition study in 1992 and then, in February 1993, of the EAGLES project, aiming specifically at defining standards or preparing the ground for future standard provision.
The areas of concern to EAGLES are text corpora, computational lexicons, grammar formalisms, evaluation and assessment, and spoken language. For each area a core working group (WG) has been established, where leading experts of both the research and the industrial communities are represented, combining their efforts towards the development of a common basic European infrastructure and agreed linguistic specifications.