next up previous contents
Next: Parallel corpora Up: Corpus Typology Previous: Reference corpora

Monitor corpora

 

The understanding of large corpora has increased with experience of handling them and the development of electronic applications to publishing has made data available in very large quantities. It became clear some years ago that the assumption of a finite limit on a corpus for any length of time was an unnecessary restriction. For some uses, it is essential to achieve a steady corpus size and constitution, but this is easy to devise within a large and constantly moving collection. The question that arose was how to manage the large quantities of data that were foreseen for what is known as a monitor corpus.

The first model was of a corpus of a constant size, so that the software of the day could cope with it, which would be constantly refreshed with new material, while equivalent quantities of old material would be removed to archival storage. The constitution of the corpus would also remain parallel to its previous states.

This model gave rise to the idea of rate of flow as the best way of managing the corpus. Instead of setting, say, 10 million words as the proper proportion of that genre, the setting could just as easily be 10 million words a year. Or a month, or a week. The language would flow through the machine, so that at any one time there would be a good sample available, comparable to its previous and future states.

Such a model opened up new prospects for those interested in natural language processing, and it added another dimension to contemporary corpora -- the diachronic. New words could be identified, and movements in usage could be tracked, perhaps leading to changes in meaning. Long term norms of frequency distribution could be established, and a wide range of other types of information could be derived from such a corpus.

Some scholars were less than happy about disposing of the older texts as new ones came in. That problem, however, was neutralised by the fast-expanding power and memories of the machines. There is no need to move any text out of modern systems. However, to manage a monitor corpus to the best advantage, it is convenient to divide it into batches of a similar size and constitution.

Over time the balance of components of a monitor corpus will change. New sources of data will become available and new procedures will enable scarce material to become plentiful. The rate of flow will be adjusted from time to time.



next up previous contents
Next: Parallel corpora Up: Corpus Typology Previous: Reference corpora