Reference corpora


A reference corpus is one that is designed to provide comprehensive information about a language. It aims to be large enough to represent all the relevant varieties of the language, and the characteristic vocabulary, so that it can be used as a basis for reliable grammars, dictionaries, thesauri and other language reference materials. The model for selection usually defines a number of parameters that provide for the inclusion of as many sociolinguistic variables as possible and prescribes the proportions of each text type that are selected. A large reference corpus may have a hierarchically ordered structure of components and subcorpora.

Questions of balance and representativeness recur in the discussion of reference corpora. They are extremely difficult to define, and yet fairly easy to work with. While it is not normally claimed that there is a core variety of a language, there appear to be a large number of heavily overlapping varieties, sharing the bulk of their vocabularies and almost all the syntactic rules. Marginal vocabulary items differentiate them and slight individuality of phraseology. Some general features, associated with such things as formality, speech, preparedness and broad subject-matter, group them in people's minds, and a rough kind of representativeness is achieved by ensuring that a large quantity of text exemplifies each of these parameters.

Special corpora are made up of texts that do not overlap as much with the large central pool. To be clearly `in a language' they must show quite a number of the grammatical and lexical features of that language, but the markedness of the patterns unique to them serves to differentiate them clearly from the general varieties of the language. In due course, and with the growing influence of internal criteria, reference corpora will be used in order to measure the deviance of special corpora.

Reference corpora are at the heart of the future development of corpus-based work in Europe and elsewhere. Reference corpora in several languages, constructed on similar principles, form a group of comparable corpora (see 10 below).

Example: The Bank of English

The Bank of English is a reference corpus. From the total holdings in Birmingham, a corpus is identified from time to time and made available world-wide via the Internet, with appropriate software. At present the corpus contains around 167 million words, soon to top 200 million. This is divided into several subcorpora:

Newspapers 43 million words
Books 37
Magazines 38
Radio 39
Ephemera 1.5
Informal Spoken 8.5

The first four subcorpora are samples from fairly plentiful material (though magazines have only recently become available). They are kept roughly comparable in size, centring on 40 million words. There is a lot more in the vaults, so to speak -- perhaps 150 million words of newspaper text alone -- but only a proportion is on-line.

The Ephemera subcorpus has to be rekeyed from a wide variety of pamphlets and brochures, and is laborious and expensive to acquire. The Informal Spoken corpus, though dwarfed in size by the others, is possibly the largest available of its kind.

In turn these are broken down into further subcorpora, e.g. for Newspapers, UK is a subcorpus; and then into components such as The Times (10m), The Guardian (12m), etc. The Radio subcorpus is divided into BBC (18m), and American NPR (21m), and each of these is broken down into components by date. The Magazines subcorpus picks out The Economist (7m) and the New Scientist (3m), and puts the rest together as a general subcorpus.

The retrieval software allows the user to consult the whole corpus or any grouping of its components, either temporarily or as the default for that user. There is no restriction on the combinations of characters and codes that can be selected, with gaps, wildcards and varying sequence. Co-occurrences are particularly easy to examine, either within a putative sublanguage or in the full range of the corpus, for building translation tools or lexicons.

