Next: Reference corpora Up: Corpus Typology Previous: Samples

Sublanguages

Sublanguage is an important concept in natural language processing. It is assumed that by narrowing the subvariety, usually in a specialised communicative context, the actual structure of the language will simplify, and thus become more amenable to automatic processing. The vocabulary, too, is restricted and often specialised. There are corresponding constraints at semantic, conceptual and pragmatic levels=1, ; (Harris1988)

A sublanguage is thus defined simultaneously by internal and external criteria, but the internal criteria are crucial. It remains to be seen whether the external and internal criteria actually match in practice. The study of genre, and LSP (languages for special purposes) shows that writers conform to quite elaborate prescriptions when composing in a technical or professional context, so it is not surprising to find many similarities.

A sublanguage is at the other end of the linguistic spectrum from a reference corpus (see 7). The homogeneity of its structure and specialised lexicon makes it useful for NLP purposes and allows the quantity of data to be kept small, i.e. it demonstrates, typically, good closure properties. The concept of sublanguage is to be distinguished from those of artificial language or reduced language. The latter are deliberately created, whereas sublanguages evolve naturally (although at the level of terminology there may be deliberate acts of creation).

Increasingly, the natural language processing community is finding that it needs access to corpora containing sublanguage material, in order to build systems capable of handling specialised texts=1, ; (MNaught1993) Under our previous definition, corpora consisting of sublanguage material are special corpora.