next up previous contents
Next: Documented Up: Characteristics Previous: Quality

Simplicity

  The default value of Simplicity is plain text. This means that the user can expect an unbroken string of ASCII characters, with any mark-up clearly identified, and separable from the text. Nowadays it is likely that many texts will be in SGML format, and in the future perhaps TEI. These mark-ups have been carefully designed and do not impose any additional linguistic information on the text. Largely, their role in relation to text representation is to preserve in linear coding some features which would otherwise be lost. They are perceived as helpful but their presence must be recorded, and the original text must be easily retrievable.

The same conventions for mark-up are extendable to various annotations that add information provided by expert linguistic analysis. Such information is the organisation and interpretation of textual features, and it varies from analyst to analyst and from purpose to purpose. Other sections of this report deal with types of analytic annotation. A plain text policy is not opposed to such annotation, nor to the use of the same mark-up conventions. For clarity in the future it would be helpful to distinguish between added codes which encode only surface features of texts that would otherwise be lost in transfer to a machine, and added codes which encode analyses and interpretations. This distinction would have to be made carefully for spoken transcription, because it can be claimed that the orthographic transcriber adds analytic notation of a kind, but one which is so conventional and familiar that people can treat it as a sophisticated mark-up, and quite distinct from intonation annotations or grammatical tags (see 4 below).

More difficult is the question of annotated corpora. It is proposed that this term is used for any corpus which includes codes that record extra information -- provenance, analytical marks, etc. Again the annotations should be separable from the plain text in a simple and agreed fashion. A set of conventions for removing, restoring and manipulating annotations is necessary, especially as the next few years will see a large growth in the provision of annotated corpora. It is naive to expect that big corpora will remain easy to manage if they are full of various annotations; retrieval times are already critical.



next up previous contents
Next: Documented Up: Characteristics Previous: Quality