Preliminary Recommendations


This normally divides into spoken and written. We have added a third, electronic, to emphasise that language transmitted in electronic media is not quite the same as the older established modes. There is not universal agreement about how exactly the special status of text originating as electronic files should be related to the two long-established modes, and the point should be followed up in order to achieve a consensus if possible. Certainly electronic text shows internal features which are different from both traditional written and traditional spoken text, and the social role of such text is also new and different; for these reasons a third mode is at least temporarily posited here.

The spoken mode is included in this typology and is dealt with at greater length below; (see also Leech et al. (eds), 1995 for an up-to-date account of issues involving spoken corpora).

This typology follows the recommendations of the EAGLES Recommendations on Corpus Typology in order to avoid confusing the modes. As Firth (1957) said, written language has ``the implication of utterance'', and equally, any policeman knows that anything you say can be written down. So the only way of keeping the two apart is to classify according to the actual mode of the material when selected for inclusion in the corpus. Hence descriptions like `written-to-be-spoken' are not to be found below; they are too close to the operations of the Intentional Fallacy.

Spoken language is often thought to be less formal than written, more impromptu. This may be statistically true in a broad general perpective, but there is plenty of formal spoken language around, on the radio and television, in legal and official circles and in diplomacy. Equally there is still a lot of very informal correspondence and despite the invention of e-mail.

Sinclair makes this point with a figure, reproduced here, of audience size related to genres of spoken and written material (Leech et al. (eds), 1995). The formality level of a piece of language probably owes more to such factors as audience size and occasion than it does to the mode.

spoken                              BIG                         written

      radio/tv                  1,000,000s               newspapers
        local radio               100,000s             magazines/books
          rallies                  1,000s            notices
            lectures                100s           local publications
              classrooms             10s         workplace records
                discussions          5s        circulation lists
                  interviews         3s      working groups
                    conversation     2s    private letters


Following the EAGLES recommendations on corpus typology, there is a conscious attempt in this report to avoid mixed categories, because the classification rapidly becomes meaningless and arbitrary if mixing is permitted. It has already been pointed out that anything that is said can be written down, and anything written can be read out. Either speech or writing can be converted to electronic form with ease. More and more texts are available in more than one form, for example correspondence which is composed on a word-processor and then printed out for transmission by ordinary post, or newspaper archives put on CD-ROM.

However, texts are rarely composed for transmission simultaneously in more than one mode, and where a text is found to exist in two or all three modes it is usually easy to identify the primary mode. A conversation, recorded, transcribed and keyed into a computer exists in all three modes, but if it is chosen for the corpus because it is a conversation, and if researchers consider that its priimary mode is spoken, then it fills a place in the spoken dimension of the corpus. The written transcription can usually be discarded if it contains no more information than the electronic version, but the original sound recording will normally be preserved, because no transcription is an adequate substitute (Leech et al. (eds), 1995).

Computers are already able to hold texts in one or more modes, and in nonlinguistic formats as well (e.g. the sound track of a conversation, or a digitised facsimile of the written page), and to correlate the different versions; in the near future it will become commonplace to hold alternative versions, so the corpus design will have to be ready with methods for multiple classification of texts.

For a meaningful classification we have to go back to the design of the corpus and the place in that design that a given text is chosen to fill. The corpus builder must be clear enough in his or her mind which is the primary mode, and distinguish that from any others.

All texts in a corpus end up in electronic form, but that is not the reason for establishing the category of electronic mode in this typology. The reason is that a large body of text material is now appearing only in electronic form, or apparently primarily in electronic form; the communicative features of this new means of transmission lead to distinctive choices of form and style, that are not found in either spoken or written texts.

There are, and there always have been, controversial categories between the spoken and written forms -- for example a playscript. Although written down, and often printed, no-one disputes that the intention of the playwright is eventually to hear it spoken in performance. However, a successful play will be produced many times, with many alterations to the text; each performance of each production is a distinct artistic experience, and can be recorded and archived for good reason. The original text of the play, however, is not affected by these spoken performances (unless the author alters the text, and thus puts the text through different editions).

Closely related to the playscript is dialogue in the novel -- a representation of spoken language, but one which is not intended in the first instance to be presented in the spoken a mode; if, as so often nowadays, the author has his or her eye on a screen or TV adaptation, a new text will be prepared, probably by someone else, and the dialogue may change substantially. Some authors, e.g. Steinbeck, write novels that have most of the characteristics of playscripts, while others, from Faulkner to Runyon, write novels whose language is written to represent the speech of the narrator.

These are complex categories, and capable of great variation; many authors experiment with forms, and there are now electronic novels, where the reader participates in the choice of text he or she reads (there was at least one paper novel written on the `programmed learning' model, where the reader had choices to make at various points). Provision is made in the typology for `special' categories, which are already institutionalised and separately defined, and these can be used to maintain the clarity of classification without requiring the possibility of general mixed categories.

