Under-resourced languages suffer from a chronic lack of available resources (human-, financial-, time- and data-wise), and of the fragmentation of efforts in resource development. This often leads to small resources only usable for limited purposes or developed in isolation without much connection with other resources and initiatives. The benefits of reusability, accessibility and data sustainability are, more often than not, out of the reach of such languages.
Yet, these languages are those that could most profit from emergent collaborative approaches and technologies for language resource development. Given the high cost of language resource production, and given the fact that in many cases it is impossible to avoid the manual construction of resources (e.g. if accurate models are requested or if there is to be reliable evaluation) it is worth considering the power of social and collaborative media to build resources, especially for those languages where there are no or limited language resources built by experts yet.
Collaborative, Web 2.0 and Web 3.0 / Semantic Web methods and methodologies for data collection, annotation and sharing seem particularly well-suited for collecting the data needed for the development of language technology applications for under-resourced languages. Indeed, the collaborative accumulation and creation of data appears to be the best and most practicable way to achieve better and faster language coverage and in purely economic terms could well deliver a higher return on investment than expected. Moreover, it is a good way to approach a small population of speakers who live in remote countries, or are scattered in diaspora all over the world.
Some specific questions that the workshop will address include the following:
- How can collaborative approaches and technologies be fruitfully applied to the development and sharing of resources for under-resourced languages?
- How can small language resources be re-used efficiently and effectively, reach larger audiences and be integrated into applications?
- How can they be stored, exposed and accessed by end users and applications?
- How can research on such languages benefit from semantic and semantic Web technologies, and specifically the Linked Data framework?
We therefore specifically encourage submissions about:
- Experiences in the creation of Linked Open Data and/or Linguistic Linked Open Data for under-resourced languages;
- Using existing Linked Open Data knowledge resources such as DBpedia, Freebase, YAGO, Lexvo, schema.org, etc. in semantics-driven approaches to resource development for under-resourced languages;
- Scaling existing language resource infrastructures to thousands of languages;
- Crowd-sourcing of linguistic data and annotations;
- Collaborative bootstrapping of language resources and language technologies (LRTs) for under-resourced languages from existing LRTs for better-resourced languages;
- Mining the Web and social media for linguistic data;
- Developing and/or using language-independent software frameworks for under-resourced languages and other collaborations across language groups;
- Ethical, sociological and practical issues in collaborative approaches and technologies;
- Usability of existing infrastructures for the development of collaboratively created resources.