INTERA - WP5 - Resource Production

Within this workpackage - co-ordinated by ILSP (Institute for Language and Speech Processing, Greece) - ILC is responsible for the production of multilingual terminological lexicons.


The objectives of this workpackage are:

  1. production of new multilingual resources (parallel corpora and terminologies) for the less widely spoken languages, including the Balkan ones;
  2. investigation of the user needs to specify the domain(s) of interest of the e-Content professionals;
  3. adaptation and extension of existing standards and specifications for multilingual resources;
  4. elaboration of a commercially attractive model for the production of language resources.

Description of Work

This workpackage is organized around three tasks.

First task - Technical specifications for the selection and encoding of multilingual resources
The specifications of the resources to be produced will be set on the basis of the previous experience of the Consortium and of the user needs of the e-Content world (which will be specified through contacts with the user forum and through previous surveys).
The user needs will influence the selection of the domain(s) in which the resources will have to be developed, as well as the types of information to be annotated.
The specifications will also take into account the standards developed throughout the project and could be reviewed to reflect the requirements identified during the implementation phase.

Second task - Production of multilingual resources (parallel corpora and terminological lexicons)
In a subsequent stage the partners will construct multilingual resources, namely parallel corpora (with a target size of 12 million running words) and multilingual terminologies.
Parallel corpora will be based on raw data to be provided by external collaborators identified and contacted during the project.
The partners will enhance the corpora by adding the necessary information to them, namely structural annotation, alignment and any other information that will be considered necessary on the basis of the user requirements.
Multilingual terminological resources will be built by extracting candidate terms from the corpora, submitting them to the evaluation from experts and enriching them with the appropriate information.
Both types of resources will be encoded in formats which are compatible with the project standards.

Third task - Elaboration of a model for producing language resources
The partners will outline - in the form of theoretical and practical guidelines - a model that is commercially viable for the production of language resources.