TOSCA annotation scheme

Next: SUSANNE annotation scheme Up: Syntactically annotated corpora Previous: Lancaster/IBM Treebank and the

Preliminary Recommendations

TOSCA annotation scheme

TOSCA (Tools for Syntactic Corpus Analysis) is an annotation project developed at the Katholieke Universiteit at Nijmegen, the Netherlands. The main aim of the project is the production of resources for linguistic research in the areas of syntax and language use.

Introduction

The TOSCA annotation scheme has been used in the analysis of the Nijmegen corppus, and of the TOSCA corpus, both of which consist of mainly written language. It is also being used for some parts of the ICE corpus, in which spoken language is also included.

Method of annotation

The TOSCA annotation scheme is applied by an interactive system between linguist and computer. The computer is used to produce all possible analyses, from which the linguist selects the correct choice. The part of speech tagging and the addition of syntactic labels take place as part of one process, in which the chosen sequence of word class tags is used as input to the parser. The parser is automatically generated from a formal grammar in the AGFL formalism (Affix Grammar over Finite Lattices).

Syntactic tags

In the TOSCA annotation scheme, constituents are labelled for their function and category, while additional syntacticosemantic information is contained in various attributes.

The three major units of description are the word, the phrase and the clause/sentence. The non-lexical category labels are shown in table 3.8:

**Table 3.8:** TOSCA tags
AJP	Adjective phrase
AVP	Adverb phrase
CL	Clause
CLOID	Clausoid
CONJ	Conjunct
COORD	Coordination
DAJP	Discontinuous adjective phrase
DTP	Determiner phrase
NP	Noun phrase
NPAP	Appositive noun phrase
PP	Prepositional phrase
QTAG	Question tag
S	Sentence
SUBP	Subordinator phrase
TXTU	Textual unit
VP	Verb phrase

In addition to these labels, there are over 90 function labels which are added to the annotation, and over 100 attribute labels. The function labels identify such phenomena as:

Heads and pre/post modifiers within phrases;
Subject/object/indirect object;
Notional subject/notional object;
Raised subject complement/raised direct object;
Transitive complement/focus complement;
Subject/object complements;
Auxiliary verbs, operators, interrogative operators and main verbs;
Adverbials.

The attribute labels represent the field in which some of the following are included:

Active/passive voice;
Transitive/ditransitive/monotransitive/dimonotransitive;
Ellipsis/enclitic/proclitic;
Attributive/predicative;
Indicative/subjunctive/progressive/present/past;
Comparative/superlative/intensifying;
Declarative/interrogative/exclamatory/intensive/negative;
Raised subject complement/raised direct object.

In addition to the above mentioned syntacticosemantic labels, the annotation scheme contains a number of labels used for extra textual material such as speaker changes, headings and pauses.

Examples

The following example shows how the simple sentence He walked in the garden is analysed with the TOSCA scheme. The lower case tags in brackets are the attribute labels, while the first label in capitals represents the non-lexical category, and the second shows the function label:

   NOFU, TXTU ()     

       UTT, S (act, decl, indic, intr, past, unm)    

         SU, NP ()     

            NPHD, PN (pers) {He}     

         V, VP (act, indic, intr, past)     

           MVB, LV (indic, intr, past) {walked}     

         A, PP ()     

            P, PREP () {in}     

            PC, NP ()     

               DT, DTP ()     

                  DTCE, ART (def) {the}     

               NPHD, N (com, sing) {garden} 

       PUNC, PM (per) {.}

A more complex example is shown below:

   NOFU,TXTU()

    UTT,S(decl,indic,intr,pass,pres,unm)

     SU,NP()

      DT,DTP()

       DTCE,ART(indef) {An}

      NPPR,AJP(attru)

       AJHD,ADJ(attru) {alternative}

      NPHD,N(com,sing) {pathway}

     NOFU,COORD(decl,indic,intr,pass,pres)

      CJ,CONJ(decl,indic,intr,pass,pres)

       V,VP(indic,intr,mod,pass,pres)

        OP,AUX(indic,mod,pres) {may}

        AVB,AUX(indic,infin,pass) {be}

        MVB,LV(indic,motr,pastp) {deprived}

       A,PP()

        P,PREP() {of}

        PC,NP()

         NPHD,PN(pers,sing) {it}

      COOR,CONJN(coord) {and}

      CJ,CONJ(decl,indic,intr,pass,pres)

       V,VP(indic,intr,mod,pass,pres)

        A,CON() {hence}

        AVB,AUX(indic,infin,pass) {be}

        MVB,LV(indic,motr,pastp) {controlled}

       A,AVP(excl)

        AVHD,ADV(excl) {simply}

       A,PP()

        P,PREP() {by}

        PC,NP()

         NPHD,N(com,sing) {limitation}

         NPPO,PP()

          P,PREP() {of}

          PC,NP()

           NPHD,N(com,sing) {substrate}

     A,CL(act,indic,intens,pres,sub,unm)

      SUB,SUBP()

       SBHD,CIBJN(subord) {when}

      SU,NP()

       NPHD,N(com,plu) {demands}

       NPPO,PP()

        P,PREP() {on}

        PC,NP()

         DT,DTP()

          DTCE,ART(def) {the}

         NPPR,AJP(attru)

          AJHD,ADJ(attru) {main}

         NPDH,N(com,sing) {pathway}

      V,VP(act,indic,intens,pres)

       MVB,LV(indic,intens,pres) {are}

      CS,AJP(prd)

       AJHD,ADJ(prd) {heavy}

    PUNC,PM(per) {.}

Next: SUSANNE annotation scheme Up: Syntactically annotated corpora Previous: Lancaster/IBM Treebank and the