Written data, spoken data and speech data

Language corpora can comprise only written, only spoken, or both written and spoken data.

Spoken data can be collected for a variety of R&D tasks. We distinguish here between two different approaches:

  1. The approach of the R&D speech community: the central issue is the speech signal itself, and its acoustic and articulatory properties. The symbolic representation is usually made by means of a phonetic alphabet. We refer to data of this type as speech data, and we call the deliberately designed collection of speech data speech corpora or speech databases.
  2. The approach of corpus linguists and the NLP community: the central issue is the use of languages and the analysis of the data at various linguistic levels of description, i.e. lexical, morphosyntactic, syntactic, etc. The representation is usually made by means of a phonological alphabet, or of an enriched orthographical system. We use the term spoken corpora to refer to such corpora.

The document on spoken texts (EAGLES, 1996f) points out in detail the differences and similarities between the two approaches.

In EAGLES, speech corpora are dealt with by the Spoken Language Systems Working Group.

The EAGLES Text Corpora Working Group aims at providing guidelines for encoding both written and spoken data: usually, in fact, large language corpora comprise both types of data.

However, recognising the fact that spoken data increasingly represent a point of convergence of the interests of both the NLP and speech communities, the provision of guidelines for encoding of spoken data has been assigned to a subgroup for spoken corpora, formed by members of the two Working Groups (TCWG and SLWG).