Sharing AudioVisual language resources for Automatic Subtitling

Type of project: European  |  Start date: 01/05/2012  |  End date: 30/04/2014

The SAVAS project collected spoken and textual resources in six European languages and built domain-specific Large Vocabulary Continuous Speech Recognizers (LVCSR) to solve the automated subtitling needs of the Media Industry.

More specifically, the main objectives of the project were:

  1. to make more effective the acquisition and annotation of audiovisual language resources produced by broadcasters and subtitling companies for the development of LVCSR systems targeting automated subtitling;
  2. to deploy a platform to share audiovisual language resources between the media industry and the LVCSR developers through the most suitable legal and business data trading approaches within the Media Industry;
  3. to show the impact of feeding LVCSR technology with existing audiovisual language resources for automated subtitling purposes.

In order to achieve these goals, SAVAS:

  • collected spoken and textual resources in the languages addressed from the broadcasters and subtitling companies acting as data providers within the consortium;
  • transcribed and annotated the collected corpora into a form suitable to train acoustic and language models of LVCSR systems using a combination of automatic and collaborative approaches;
  • built a local META-SHARE repository containing the collected and annotated SAVAS language resources to allow their reuse;
  • adapted and trained dictation and transcription LVCSR systems with the SAVAS language resources;
  • integrated and evaluated the developed systems into several automated subtitling application scenarios in order to show the impact of audiovisual data sharing for automated subtitling.


Funding programme:
7th Framework Programme

Funding body:
European Commission

Grant agreement:


CNR-ILC Research Unit Chair:
Monica Monachini

Paola Baroni
Francesca Frontini