Preliminary Recommendations


Style is a notorious term, because it is used in so many different ways by researchers from several disciplines, and has popular meanings as well. It is used here to mean the way texts are internally differentiated other than by topic; mainly by the choice of the presence or absence of some of a large range of structural and lexical features. Some features are mutually exclusive (e.g. verbs in the active or passive mood), and some are preferential, e.g. politeness markers and mitigators.

As with topic, there are no institutionalised schemata to which we can turn. Although a great deal is talked about style, and there are several parameters of organisation proposed in the literature, there are no agreed standards for any one parameter. For example, most students of language believe that a parameter of formality is required, and terms like `formal', `informal', `colloquial', etc., are freely used and not always well defined. In one of the most influential proposals (in its time), Joos (1961) set five levels of `frozen', `formal', `informal', `colloquial', `intimate'. However useful this was, Joos might just as easily have proposed four, six, seven or twenty, because the motivation for five levels was mainly convenience.

Halliday et al. (1964) distinguishes between registers according to three dimensions, `field of discourse', `mode of discourse' and `style of discourse' (in later work he changed the label `style' to `tenor'). Style is used here to refer to the relations among the participants. Halliday suggests a primary distinction into `colloquial' and `polite', a primary distinction which is adopted in many of today's dictionaries. He also claims that the styles of discourse must be treated as a cline with categories such as `casual', `intimate' and `deferential'.

In most of today's dictionaries there is a classification of certain words according to style. The most common distinction is between formal and informal language. Even within these broad categories, one dictionary may classify a word a formal where another may not. Since such decisions rest with the lexicographer, it is not surprising that inconsistencies exist between interpretations of the style of a word. `Formal' is defined as a label which

[...] usually means that this is a word which is most likely to be used in highly formal writing or in formal speech read aloud.
(Summers et al., 1993)
Outside lexicography, categories such as `formal' and `informal' are usually defined by the type of context of situation in which the language is found, for example formal language will be found in forms, pronouncements, official documents and the like.

Some dictionaries (e.g. Collins Robert (Atkins et al. (eds), 1987)) have grades of formality or informality so that there are sub-groups of style within the formal and informal labels. For example we find a class for a word which

while not forming a part of standard language, is used by all educated speakers in a relaxed situation but would not be used in a formal essay or letter, or on an occasion when the speaker wishes to impress.
Further down the cline, is a word which
indicates that the expression is used by some but not all educated speakers in a very relaxed situation.
What would then qualify as a `relaxed situation' and is it expected to be the same for all these `educated speakers'?

In Webster's Dictionary (Woolf et al., 1976) the style labels are `slang', `nonstandard' and `substandard', each defined by its relation to a standard or norm. Most of the dictionaries provide style labels for words which are not though to be part of standard language. Some dictionaries go as far as to provide labels for the particular sub-language in which a word is predominately used. Sansoni (Macchi (ed), 1988) offer a whole range of style labels which situate words into different sub-languages, for example linguaggio burocratico, commerciale, cinematografico, giornalistico, infantile, marinaro, scolastico, universitario and so on; the list is substantial.

`Literary' style is also a popular class of style in dictionaries. Alongside this we also find `poetic' style but it is not clear how the two relate. Is poetic style a sub-class of literary style? Do we assume both can be considered under Webster's class of `nonstandard'. Do we then class `slang' as a style within `substandard' with `informal' as a kind of bridge between the two? In which case where would we place the category `offensive' which is also a common style label. Is this on the opposite pole from `polite'? If so how does it relate to Halliday's term `colloquial', which is the alternative to `polite'?

There seem to be so many different style labels used in dictionaries that it is difficult to see how they relate to one another in the proposed cline between the colloquial and polite, informal and formal. There is a whole mass of categories which accompany the informal/formal distinction, though the structure or hierarchical value of each is unclear. Other style categories used such as slang, colloquial, common, polite, rare, popular are not often defined by the dictionaries that use them, as if it is assumed that they have self-explanatory boundaries.

In practical lexicography, where compilers face decisions about style classification on a daily basis, it is clear that there is absolutely no natural consensus about the formality of an expression. People are all different, for a start, and likely to have different baselines from which they make their judgements. They have different attitudes to propriety in language use, and they have different language experience, where the influence of their ages and the regions of their upbringing play a part. They are also unsure of the way in which the terminology should be applied. Is it good or bad or neutral to be colloquial? Where does slang fit in?

There is also an underlying issue that never resolves itself -- some choices made in writing are considered informal, but the same choices in speech are just neutral. Is there a systematic displacement of the spoken language with respect to the written so that its realisations are always a notch or two down in formality from writing? Definitive answers to such questions require an alignment of internal and external criteria for which we will have to wait for some time, but the questions indicate the lack of clarity that characterises the description of style.

Style is the other criterion for which we recommend an automatic analysis based on linguistic criteria, and we recognise that no suitable package is yet available. There are signs of progress in the statistical analyses of texts such as those already being used in literary and forensic linguistics for questions of authorship. What is important for the purposes of a text typology is that the criteria for categorisation according to style be established objectively.

Biber (1988) offers a methodology for the objective grouping of variations in English texts through statisitcal analyses. The analysis is based on the identification of different clusters of linguistic features across a range of written and spoken English texts. Using a technique called factor analysis, he successfully identifies the linguistic characteristics of texts by an objective analysis of the language data. Biber's fundamental claim is that the frequent co-occurrence of a group of linguistic features in texts is an indication of an underlying function shared by those features. Therefore we can identify which linguistic features consistently group together to perform a particular communicative function, which features co-occur and which features are mutually exclusive. We can interpret these results to establish a correlation between variations in linguistic groups and function. Not only that, but we can use the analysis to objectively define a set of texts which belong to each variation in the English language, with the intention of using this kind of classification technique to categorise new texts for inclusion in a corpus. We envisage, however, that the categories would be continually refined and not fixed, to cater for the steady flow of language into and out of the corpus.

Biber, who sets out to identify fundamental variations between speech and writing, works with texts taken from the LOB (Lancaster-Oslo-Bergen) corpus of written English and the London-Lund Corpus of spoken English. These texts have previously been categorised into various genres for the corpus, based, we assume, on external criteria. Categories in the LOB corpus include such genres as `press reportage', `editorials', `press reviews', `religion, `skills and hobbies', `popular lore', `biographies' and so on; in the London-Lund corpus categories are `face-to-face communication', `telephone conversation', `planned speeches', `broadcasts', and so on.

Biber differentiates between `genre' which he uses to describe categorisation according to external criteria, and `text type' to refer to groupings of texts that are similar with respect to their linguistic form, irrespective of genre categories.

The first step in Biber's methodology for definition of text type through internal linguistic criteria is to review any previous research on linguistic features which will identify potentially important features. Here the aim is not to establish which are the more important linguistic features (since this should be done objectively through statistical analysis of the data) but to provide as wide a range as possible of possible significant linguistic features. Biber identifies 67 linguistic features of English text which he includes in the analysis (Biber, 1988: 86-87 offers a full list). Having chosen the texts to work with, the investigator must then obtain frequency counts of all 67 linguistic features in each of the texts used in the analysis. These figures must then all be normalised for a certain text length (here 1,000 words) to ensure comparability. The mean, minimum and maximum frequencies, the range between these frequencies and the standard deviation are all calculated before the factor analysis begins. Features with very low frequency of occurrence are discarded as insignificant.

The aim of the factor analysis is to identify groups of linguistic features which co-vary. To say they co-vary does not necessarily mean that they co-occur but that there is a definite correlation (or inverse correlation) between their frequency counts in the texts. This means that it is equally important to learn that two features co-occur as it is to learn that two features are mutually exclusive, or that the presence of one will point to the absence of the other.

In this way Biber establishes factors, which are sets made up of different linguistic criteria. The factor will include both negative and positive `weightings' which indicate linguistic features which co-occur (positive weightings) or whose presence marks the absence of the other (negative weightings). The most significant features are those which have the largest weighting irrespective of whether this is shows attraction or repulsion. Biber ultimately establishes 7 factors which are representative of the texts in the study and are mutually exclusive.

We now turn to the interpretation of these factors by which have been identified from the texts and which will, in turn, be used to identify styles of text. Up to this point identification of the features, calculation of their frequencies and the extent of their co-variation has been objective and automatic. The factors which have been established are then interpreted by Biber to show the communicative functions to be associated with each factor. The reader is referred to Biber (1988: 104-114) for a full interpretation of the factors identified. For convenience we show here some examples of some of the kind of results obtained.

He finds, for example, that in the first factor there is a high weighting of nouns, private verbs, present tense, pronouns, WH-questions and so on. This is what the statistics tell us. Biber then interprets these patterns of linguistic features in terms of the style of the text. A high density of nouns (the primary bearers of referential meaning in a text), for example, indicates a great density of information. Longer word length suggests more specific, specialised meanings, and the type/token ratio also points to a high density of information as well as very precise lexical choices resulting in an exact presentation of informational content. An interactive style is indicated by a high weighting of present tense forms indicating a verbal rather than a nominal style, also of pronouns and WH-questions. The interpretation continues but there is no need to go into such detail here. The important point is the value of such analysis in the identification of style in text.

Biber relates the factor to the text and the text is then given a factor score. This enables us to relate the linguistic groupings of the first factor, for example, to two separate communicative parameters, namely that the primary purpose of the writer/speaker is informational while adopting an interactive, affective and involved style; and secondly that the production circumstances are such as to allow careful editing possibilities, allowing precise lexical choices and an integrated textual structure.

The dimensions identified by Biber are then related to the genres which have already been pre-established for classification of texts in the corpora used. It becomes obvious that these genres are not coherent in terms of their linguistic features. Some dimensions are equally applicable to various genres, while within one genre there may be a wide range of variation. As example, Biber shows that within `academic prose' there is a wide range of variation, as is there within the genre `conversation'. This would indicate that we can not take these genres as representative of written and spoken English respectively, nor can we take the analysis of any one text to be representative of the genre. The study also shows that there is no one, absolute difference between speech and writing, but that there are, instead, several dimensions of variation which are manifested in both.

The dimensions identified in this study succeed in defining a set of relations among texts which can be used for an overall text typology. As Biber points out, since the texts in the study cover a wide range of discourse types in English as well as the linguistic features of many communicative functions, the dimensions of variation that he establishes provide parameters of linguistic variation which he claims exist among English texts as a whole.

The implications of such research are many. The methodology outlined here has already been applied to various other kinds of research, namely the comparison of dialects within English and identification of linguistic variation between British and American English through analysis of lexical and syntactic features; for stylistic comparisons not only between specific authors, but of the historical evolution of English written texts. Particularly relevant in the discussion of style is the stylistic research done by Biber & Finnegan (1988) who use the technique outlined here in conjunction with cluster analysis to identify eight styles of stance in English texts, e.g `Cautious', `Secluded from Dispute', and so on. Within a multi-lingual environment, the consequences of this kind of analysis are important since this approach to textual analysis can be used for other languages, if similar features for classification can be identified, and therefore we can provide comparable text types within a multilingual environment.

Biber works with a considerable number of linguistic features; by contrast Nakamura has developed techniques for examining the classificatory role of individual features, using Hyashi's Quantificational Method, Type Three. In a series of papers (Nakamura, 1986; Nakamura, 1987; Nakamura, 1992; Nakamura, 1993) he has shown how texts, genres and corpora can be differentiated according to the incidence of chosen features.

The basis of Nakamura's method is the means by which a large number of individual observations or classifications of language can be grouped together to show broad general tendencies. Although the statistical processing allows up to 14 different parameters of grouping (called `axes'), in practice the top three usually account for the vast bulk of the data; also since the results are easiest to interpret when presented in the form of a diagram, three dimensions is the most complex structure that can be shown with ease. Nakamura's diagrams show graphically the relative distances of linguistic items from each other, or the relative distances from each other of texts or corpora containing these items.

Nakamura's techniques can be brought in to classify in a wide range of circumstances. The distribution of pronouns in texts, or the distribution of grammatical tags, or the distribution of the vocabulary; there can be a lot of corpus material or a little -- Nakamura works on corpora from 1 million words to 200 million, and can use a single simple criterion or a complex set of them. From the above discussion it can be concluded that the specification of style in terms of internal criteria is feasible, but that work has only just started. Biber's classifications depend on the relevance of earlier, individual studies, and have not yet incorporated and feedback from large corpus analysis, which promises to offer new dimensions. Nakamura's most recent work (Nakamura & Sinclair, 1995) investigates collocational patterns in multi-million word corpora, showing that his analytical procedure can cope with such a scale. But these are only shafts of light in a vast darkness. As Biber recommends, progress will best be made by frequent cross-checking between internal and external criteria so that each establishes a framework of relevance for the other.

