Comparable corpora


A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora. One of the clearest is ICE -- the International Corpus of English=1, ; (Greenbaum1991) Corpora of around one million words in each of many varieties of English around the world are being assembled following the same model, which prescribes genres and the target quantity of words to be gathered in each. Originally, the corpora were all to be gathered in the same year.

The possibilities of a comparable corpus are to compare different languages or varieties in similar circumstances of communication, but avoiding the inevitable distortion introduced by the translations of a parallel corpus.

Note: `Multilingual corpora'
At present there are no multilingual corpora apart from parallel and comparable corpora; there are plenty of centres that have collected text material in several languages, and some of these collections are corpora in their own right. But unless the collections share common features of selection, at least at the level of the comparable corpus, then they are just text resources in different languages. It therefore seems unhelpful to use the term `multilingual corpus'.