next up previous contents
Next: SUSANNE annotation scheme Up: Syntactically annotated corpora Previous: Lancaster/IBM Treebank and the

Preliminary Recommendations

TOSCA annotation scheme

TOSCA (Tools for Syntactic Corpus Analysis) is an annotation project developed at the Katholieke Universiteit at Nijmegen, the Netherlands. The main aim of the project is the production of resources for linguistic research in the areas of syntax and language use.

Introduction

The TOSCA annotation scheme has been used in the analysis of the Nijmegen corppus, and of the TOSCA corpus, both of which consist of mainly written language. It is also being used for some parts of the ICE corpus, in which spoken language is also included.

Method of annotation

The TOSCA annotation scheme is applied by an interactive system between linguist and computer. The computer is used to produce all possible analyses, from which the linguist selects the correct choice. The part of speech tagging and the addition of syntactic labels take place as part of one process, in which the chosen sequence of word class tags is used as input to the parser. The parser is automatically generated from a formal grammar in the AGFL formalism (Affix Grammar over Finite Lattices).

Syntactic tags

In the TOSCA annotation scheme, constituents are labelled for their function and category, while additional syntacticosemantic information is contained in various attributes.

The three major units of description are the word, the phrase and the clause/sentence. The non-lexical category labels are shown in table 3.8:

 

AJP Adjective phrase
AVP Adverb phrase
CL Clause
CLOID Clausoid
CONJ Conjunct
COORD Coordination
DAJP Discontinuous adjective phrase
DTP Determiner phrase
NP Noun phrase
NPAP Appositive noun phrase
PP Prepositional phrase
QTAG Question tag
S Sentence
SUBP Subordinator phrase
TXTU Textual unit
VP Verb phrase
Table 3.8: TOSCA tags 

In addition to these labels, there are over 90 function labels which are added to the annotation, and over 100 attribute labels. The function labels identify such phenomena as:

The attribute labels represent the field in which some of the following are included:

In addition to the above mentioned syntacticosemantic labels, the annotation scheme contains a number of labels used for extra textual material such as speaker changes, headings and pauses.

Examples

The following example shows how the simple sentence He walked in the garden is analysed with the TOSCA scheme. The lower case tags in brackets are the attribute labels, while the first label in capitals represents the non-lexical category, and the second shows the function label:

   NOFU, TXTU ()     

       UTT, S (act, decl, indic, intr, past, unm)    

         SU, NP ()     

            NPHD, PN (pers) {He}     

         V, VP (act, indic, intr, past)     

           MVB, LV (indic, intr, past) {walked}     

         A, PP ()     

            P, PREP () {in}     

            PC, NP ()     

               DT, DTP ()     

                  DTCE, ART (def) {the}     

               NPHD, N (com, sing) {garden} 

       PUNC, PM (per) {.}

A more complex example is shown below:

   NOFU,TXTU()

    UTT,S(decl,indic,intr,pass,pres,unm)

     SU,NP()

      DT,DTP()

       DTCE,ART(indef) {An}

      NPPR,AJP(attru)

       AJHD,ADJ(attru) {alternative}

      NPHD,N(com,sing) {pathway}

     NOFU,COORD(decl,indic,intr,pass,pres)

      CJ,CONJ(decl,indic,intr,pass,pres)

       V,VP(indic,intr,mod,pass,pres)

        OP,AUX(indic,mod,pres) {may}

        AVB,AUX(indic,infin,pass) {be}

        MVB,LV(indic,motr,pastp) {deprived}

       A,PP()

        P,PREP() {of}

        PC,NP()

         NPHD,PN(pers,sing) {it}

      COOR,CONJN(coord) {and}

      CJ,CONJ(decl,indic,intr,pass,pres)

       V,VP(indic,intr,mod,pass,pres)

        A,CON() {hence}

        AVB,AUX(indic,infin,pass) {be}

        MVB,LV(indic,motr,pastp) {controlled}

       A,AVP(excl)

        AVHD,ADV(excl) {simply}

       A,PP()

        P,PREP() {by}

        PC,NP()

         NPHD,N(com,sing) {limitation}

         NPPO,PP()

          P,PREP() {of}

          PC,NP()

           NPHD,N(com,sing) {substrate}

     A,CL(act,indic,intens,pres,sub,unm)

      SUB,SUBP()

       SBHD,CIBJN(subord) {when}

      SU,NP()

       NPHD,N(com,plu) {demands}

       NPPO,PP()

        P,PREP() {on}

        PC,NP()

         DT,DTP()

          DTCE,ART(def) {the}

         NPPR,AJP(attru)

          AJHD,ADJ(attru) {main}

         NPDH,N(com,sing) {pathway}

      V,VP(act,indic,intens,pres)

       MVB,LV(indic,intens,pres) {are}

      CS,AJP(prd)

       AJHD,ADJ(prd) {heavy}

    PUNC,PM(per) {.}


next up previous contents
Next: SUSANNE annotation scheme Up: Syntactically annotated corpora Previous: Lancaster/IBM Treebank and the