Sharing AudioVisual language resources for Automatic Subtitling
Type of Project: 
Funding Body: 
European Commission
Funding Programme: 
7th Framework Programme
Grant Agreement: 
Start Date: 
End Date: 
Project Chair: 
Carlo Aliprandi (SyNTHEMA S.r.l.)
ILC Research Unit Chair: 

The SAVAS project collected spoken and textual resources in six European languages and built domain-specific Large Vocabulary Continuous Speech Recognizers (LVCSR) to solve the automated subtitling needs of the Media Industry.

More specifically, the main objectives of the project were:

  1. to make more effective the acquisition and annotation of audiovisual language resources produced by broadcasters and subtitling companies for the development of LVCSR systems targeting automated subtitling;
  2. to deploy a platform to share audiovisual language resources between the media industry and the LVCSR developers through the most suitable legal and business data trading approaches within the Media Industry;
  3. to show the impact of feeding LVCSR technology with existing audiovisual language resources for automated subtitling purposes.

In order to achieve these goals, SAVAS:

  • collected spoken and textual resources in the languages addressed from the broadcasters and subtitling companies acting as data providers within the consortium;
  • transcribed and annotated the collected corpora into a form suitable to train acoustic and language models of LVCSR systems using a combination of automatic and collaborative approaches;
  • built a local META-SHARE repository containing the collected and annotated SAVAS language resources to allow their reuse;
  • adapted and trained dictation and transcription LVCSR systems with the SAVAS language resources;
  • integrated and evaluated the developed systems into several automated subtitling application scenarios in order to show the impact of audiovisual data sharing for automated subtitling.