The SAVAS project collected spoken and textual resources in six European languages and built domain-specific Large Vocabulary Continuous Speech Recognizers (LVCSR) to solve the automated subtitling needs of the Media Industry.
More specifically, the main objectives of the project were:
- to make more effective the acquisition and annotation of audiovisual language resources produced by broadcasters and subtitling companies for the development of LVCSR systems targeting automated subtitling;
- to deploy a platform to share audiovisual language resources between the media industry and the LVCSR developers through the most suitable legal and business data trading approaches within the Media Industry;
- to show the impact of feeding LVCSR technology with existing audiovisual language resources for automated subtitling purposes.
In order to achieve these goals, SAVAS:
- collected spoken and textual resources in the languages addressed from the broadcasters and subtitling companies acting as data providers within the consortium;
- transcribed and annotated the collected corpora into a form suitable to train acoustic and language models of LVCSR systems using a combination of automatic and collaborative approaches;
- built a local META-SHARE repository containing the collected and annotated SAVAS language resources to allow their reuse;
- adapted and trained dictation and transcription LVCSR systems with the SAVAS language resources;
- integrated and evaluated the developed systems into several automated subtitling application scenarios in order to show the impact of audiovisual data sharing for automated subtitling.