next up previous contents
Next: Samples Up: Corpus Typology Previous: Documented

Spoken corpus

  There is considerable confusion over the use of this term, and it would be helpful to achieve a consensus. First, it is to be distinguished from speech corpus, a term for a special corpus described above (3.5.2). Then there is a choice. In the usage of some authorities (e.g. =1 (; Chafe1995)), it means a corpus of informal, impromptu conversation, with no media involvement. On the other hand, it is used by some to mean any language whose original presentation is in oral form -- that is, the speakers involved behave in oral mode. If such a text is later presented in written form, without change except for the transcription, it should be classified as spoken -- a BBC Reith Lecture, for example. If, in time, a spoken corpus can be stored as sound wave as well as transcript, then such a text may exist in two versions and a special kind of parallel corpus can be introduced.

Similarly, any text composed to be presented in written form can be read out, but its expression need only change in ways required by the change of medium. It is, therefore, primarily a written text.

Our preference is for the latter interpretation of `spoken corpus'. There are doubtful areas whichever meaning is chosen -- how impromptu is impromptu, how informal is informal, etc. How does one know whether a composer intends a text to be written or spoken, or both? But to reserve the term for only one small class of spoken language texts seems to distort the meaning of the words involved.

The problem is that informal, impromptu speech is regarded by many scholars as the most important variety of all, closest to the core of language, revealing the characteristic patterns of a language in a way that no other variety does. It is also the most difficult and expensive to acquire, difficult to classify and manage. The crudities of transcription make a spoken corpus unsatisfactory as held in most centres, and there is no consensus as yet about the conventions of transcription. The nearest we have is the recommendations of NERC=1, ; (NERC1994)



next up previous contents
Next: Samples Up: Corpus Typology Previous: Documented