Transcription systems

Next: Proposals for the transcription Up: Suprasegmental level Previous: Suprasegmental level

Transcription systems

The process of prosodic encoding can be defined as the symbolization of the linguistically relevant variations that occur in the domains of time, frequency and intensity in the sound wave corresponding to a speaker's utterance. The process of encoding implies deciding which variations in the physical parameters of the speech wave carry out linguistic information and finding a way to describe them by means of a symbolic system. Since physical parameters such as frequency and intensity are continuously varying over time, a symbolic coding implies also converting continuous information to a set of discrete units. Thus, symbolic coding of prosody involves at least two different levels of abstraction: a linguistic interpretation of changes in physical properties of the speech wave, and a classification of these changes into discrete categories. Finally, a notational system has to be designed in order to represent these categories. The review of events transcribed in the tradition of pragmatics, discourse and conversation analysis 2.3.1 has shown that there is a clear need for such a notational symbolic system in these areas.

For detailed surveys of prosodic transcription and encoding systems the reader is referred to Llisterri (1994b) (available at URL http://www.lpl.univ-aix.fr/projects/multext/CES/CES2.html) - from most of the material in this section is taken -, Grżnnum Thorsen (1987), Léon & Martin (1970) - which contains a chapter devoted to classical approaches to prosodic transcription - and to Gibbon (1989), reviewing most of the work in this area carried out within the SAM ({em Speech Assessment Methodologies) project. A discussion of this topic is also found in the text representation chapter of the EAGLES Handbook on Spoken Language Systems (EAGLES Spoken Language Working Group, 1995).

A great diversity of proposals exist in the field of pragmatics, discourse and conversation analysis, as has been mentioned before. Examples of notational conventions can be found in the literature reviewed in section 2.3.1. All those systems share the fact that the transcription is based in conventional spelling, enriched with some conventions to represent information that is present in the spoken discourse but can not be conveyed by means of normal spelling conventions. Symbols representing intonation unit boundaries, terminal pitch direction, accent, accent unit boundaries, pitch movements and pauses are then used in those systems.

Within the corpus linguistics tradition, Leech (1991) reports that notable exceptions to the lack of prosodic coding in spoken corpora are the London-Lund Corpus (LLC) - - described in Svartvick (Ed.)(1990) - and the Lancaster/IBM Spoken English Corpus (SEC) - described, for example, in Knowles & Lawrence (1987) - An example of the kind of work carried out in prosodic coding in corpus linguistics is found in the papers published by Knowles (1991) and by Wichmann (1991) using the SEC. As mentioned before, SEC has recently been converted to MARSEC, and part of the project has consisted in the alignment of the prosodic annotation (Knowles, 1995; more information is found at URL http://midwich.reading.ac.uk/research/speechlab/marsec/marsec.html). The annotation is based on a tonetic stress mark system, within which types of accent, tone-unit boundaries and nuclear and non-nuclear syllables are distinguised.

The Text Encoding Initiative (Sperberg-McQueen & Burnard, (Eds.) 1994) considers the transcription of prosodic phenomena including pauses, tone units or intonational phrases and shifts, defined as the point at which some paralinguistic feature - - tempo, pitch range, tension, rhythm, and voice quality - of a series of utterances by any one speaker changes. The TEI also provides an example of a set of prosodic features for the representation of stress and pitch patterns that can be defined and documented by the transcriber.

French (1992) proposals adopted by the Network of European Reference Corpora (NERC) include prosodic information in Level Three and Level Four (see 2.3.2). In Level Three tone boundaries and tonic syllables are identified, while in Level Four Level Four head syllables and tone are transcribed. There is also provisions for an orthographic and a phonemic transcription aligned with a spectrogram and a fundamental frequency (Fo) contour.

The IPA (International Phonetic Alphabet) has a set of symbols for the representation of suprasegmental elements. On the occasion of the Kiel convention in 1989 a working group on Suprasegmental Categories coordinated by Bruce was set up (Bruce, 1988,1989). It was concluded that additions were needed to represent suprasegmentals within the IPA framework. As far as intonation was concerned, it was noted that there are no specific symbols for the notation of intonation - except for tones - in the IPA. Bruce's conclusions are that

There exists an apparent need for a direct way of symbolizing intonation in a phonetic transcription. However, the opinions diverge regarding the exact way of transcribing intonation. For a phonological transcription of intonation the symbolization is very much dependent on the language and the analysis. (Bruce, 1989: 36-37)

The full set of symbols used for the transcription of suprasegmental elements in the IPA can be found in IPA (1993) (also available at URL http://www.arts.gla.ac. uk/IPA/ipachart.html).

In the domain of prosodic transcription systems to be used in speech research and in speech technology, ToBI (Tone and Break Index Tear) was developed to fulfill the need of a prosodic notation system providing a common core to which different researchers can add additional detail within the format of the system; it focuses on the structure of American English, but transcribes word grouping and prominences, two aspects which are considered to be rather universal (Price, 1992).

As described by Silverman et al. (1992) the system shows the following features: (1) it captures categories of prosodic phenomena; (2) it allows transcribers to represent some uncertainties in the transcription; (3) it can be adapted to different transcription requirements by using subsets or supersets of the notation system; (4) it has demonstrated high inter-transcriber agreement; (5) it defines ASCII formats for machine-readable representations of the transcription; and (6) it is equipped with software to support transcription using Waves Ş and UNIX programmes.

A ToBI transcription for an utterance consists of symbolic labels for events on four parallel tiers: (1) orthographic tier, (2) break-index tier, (3) tone tier and (4) miscellaneous tier. Each tier consists of symbols representing prosodic events, associated to the time in which they occur in the utterance. The conventions for annotation according to TOBI are defined for text-based transcriptions and for computer-based labeling systems such as WavesŞ.

Although primarily intended for English, work using the ToBI system is being carried out in other language such as Italian (Grice & Savino, 1995), German (Grice & Benzmueller, 1995) or Hungarian (Grice et al., 1995).

The system is also discussed in the chapter devoted to corpus representation of the EAGLES Handbook on Spoken Language Systems (EAGLES Spoken Language Working Group, 1995), and in Roach & Arnfield (1995).

The Handbook also discusses the analysis of intonation developed at the IPO (Eindhoven, The Netherlands). Rather than a transcription system, the IPO group has developed a full hierarchical theory of intonation based on the modelling of intonation contours - pitch or Fo contours - as stylized representations which are linguistically equivalent to the original contour ( see 'T Hart et al., 1990 for a complete presentation of the model). A set of basic language-dependent ``pitch movements" are proposed, and they are grouped in sequences of ``pitch configurations"; ``pitch contours" are build up from these configurations, and ``intonation patterns" are defined on the basis of grouping similar pitch contours.

As explained in section 5.1.2 SAMPA (SAM Phonetic Alphabet) was developed to cater for the needs of speech technology applications. SAMPA offers symbols for the transcription of prosodic features such as lenght, word accent, stress, tonal movements, pauses and prosodic boundaries. In a review of the system Gibbon (1989) criticizes the theory-oriented character of the system and its inseparability from the tonetic theory of stress marking. Further work on prosodic transcription within the SAM Speech Assessment Methodologies project has lead to the development of other systems such as PROSPA, SAMSINT and SAMPROSA, all of them discussed in Gibbon (1989) and briefly summarised below.

PROSPA was developed by Selting and Gibbon (Selting, 1987, 1988) specially to meet the needs of discourse and conversation analysis but was also discussed within the Prosody Group in the SAM project (Wells at al., 1992). PROSPA is aimed at the high-level broad transcription which is needed for discourse analysis, and therefore, the categories used in the transcription are based on auditive criteria.

SAMSINT SAM System for Intonation Transcription has been proposed by the SAM Prosody Working Group, and was intended to be a computer-readable system for the transcription of intonation contours within defined intonation units. The system is based on INTSINT (see below), incorporating additional facilities and simplifications (Wells et al., 1992).

SAMPROSA SAM Prosodic Alphabet has been initially proposed by Gibbon (1889), incorporating results from discussions within the SAM Prosody Working Group. The system is intended both for prosodic transcription for linguistic purposes, and for prosodic labelling in speech technology and experimental phonetic research. The system allows the transcription of global, local, terminal and nuclear tones, length, stress, pauses and prosodic boundaries. It is documented in Wells et al. (1992) and the relevant information can be found at URL http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm.

Finally, INTSINT International Transcription System for Intonation aims at providing a system for cross-linguistic comparison of prosodic systems It has been developed by Hirst (1991,1994; Hirst & Di Cristo, forthcoming), based on a stylization procedure of the fundamental frequency - or pitch - contour (Fo) build up from interpolation between target points in which significant changes occur. It is then a system which is closely linked to the the phonetic realization of the intonation contour, but at the same time is able to symbolize this contour in terms of a phonological representation. INTSINT aims therefore at the symbolization of pitch levels or prosodic target points, each characterising a point in the fundamental frequency curve.

The Fo modelling is carried out automatically by a program called MOMEL (Hirst & Espesser, 1991) that, after Fo detection, provides a sequence of target points with a time value in ms. and a frequency value in Hz. Target points can be then automatically coded into INTSINT symbols, once the position of the intonation unit boundaries has been manually introduced.

The symbolization of prosodic target points is made by means of arrow symbols corresponding to different pitch levels, either relative or absolute.

The system has already been applied to several languages (see, for example, Hirst et al., 1993) and is being used in MULTEXT Multilingual Text Tools and Corpora project (Hirst et al., 1994; more information on the project is available at URL http://www.lpl.univ-aix.fr/ projects/multext/index.html) for the encoding of intonation in the paragraphs contained in the EUROM.1 corpus.

This review shows that the systems developed so far have been designed with different purposes in mind and within different traditions. Nevertheless, it is possible to find some parameters that may help in comparing the external and internal features of each transcription system in order to assess its possible use in corpus linguistics and in speech work. The following dichotomies are suggested (Llisterri, 1994b) (available at URL http://www.lpl.univ-aix.fr/projects/multext/CES/CES2.html:

Multi-tiered vs. one-tiered systems
- One-tiered systems include the symbols for the representation of prosodic events within the segmental - orthographic or phonetic/phonemic - transcription, while in multi-tiered systems it is possible to distinguish different layers or levels, separating the segmental transcription from the suprasegmental coding. Examples of one-tiered systems can be found in the domain of discourse and conversation analysis or in the conventions adopted by the TEI and NERC; IPA, SAMPA and its derivations can be also classified within this category. TOBI and INTSINT are very clear examples of multi-tiered systems allowing the separation of different types of events from the segmental transcription. As far as the labelling of speech databases is concerned, the later systems seem to offer clear advantages.
Machine readable symbols vs. non-machine readable symbols
- Some of the transcription systems reviewed include a mapping between ASCII numbers and transcription symbols (e.g. SAMPA, SAMSINT or SAMPROSA). Other systems such as those used in discourse analysis and in corpus linguistics make use of characters which are usually available in computer keyboards; TOBI is another example of this category. It seems that a prosodic coding system aimed at facilitating exchange of labelled databases should ideally make use of machine-readable symbols.
Systems that can be applied automatically vs. systems that rely on the transcriptor's judgment
- The great majority of the systems described depend on the transcriptor's judgment, in the sense that the transcriber himself decides, after an auditory or acoustic analysis of the utterance, which is the symbol that more adequately reflects a given prosodic phenomenon. Only INTSINT can be automatically applied, taking the speech wave as a starting point and producing an abstract representation in a completely automatic way. Of course, this is an advantage when labelling of large speech databases has to be undertaken, since it ensures at least homogeneity of criteria.
Multilingual vs. non-multilingual systems
- Systems such as TOBI or PROSPA have been developed having one language in mind. Others such as SAMPA or SAMSINT address European languages, and IPA and INTSINT have been designed to cover a wider range of languages - actually both of them contain the term ``international" in their denomination -. For the purposes of a multilingual project, it is essential that the coding system should be able to convey prosodic contrasts in a number of languages, and it seems logical to use a system conceived with that purpose.
Theory-driven systems vs. data-driven systems
- Some authors explicitly claim that their system is not model-dependent; this is the case of SAMPA. On the contrary, other authors provide the theoretical background in which their coding system is based; examples are SAMPROSA, TOBI or INTSINT. In both cases the assumptions behind the system are of phonological nature, or are based on the author's conception of the phonetics-phonology interface. On the other hand, the theory behind systems used in discourse and conversation analysis is defined by the needs, the practices and the models used in the field, since the events which are coded are those which are known to be relevant in order to explain the discursive or the interactional behaviour of the speakers.

Next: Proposals for the transcription Up: Suprasegmental level Previous: Suprasegmental level