next up previous contents
Next: The tagset mapping exercise Up: Validation phase Previous: Validation phase

The tagged reference corpora

Italian and German newspaper texts have been morphosyntactically annotated according to the EAGLES Italian (EAGLES, 1996d) and Italian (EAGLES, 1996a) specifications. Table 7 summarises the distribution of texts in the German corpus; the Italian corpusgif is structured in a very similar way.

 

Economy ca. 17,000 wordforms
Politics ca. 14,000 wordforms
Culture ca. 18,000 wordforms
Sports ca. 9,000 wordforms
Local Events ca. 8,500 wordforms
Total ca. 66,500 wordforms
Table 7: EAGLES/ELSNET reference corpus for German 

The texts were prepared as follows:

The material will be made available in the following forms:

Level (b)
-- STTS tags (DE)
Level (b)
-- ELM-DE attribute/value pair annotation (DE)
Level (c)
-- ELM-DE attribute/value pair annotation (DE)
Level (c)
-- ILC-Pisa tagset (IT)
Level (c)
-- ELM-IT attribute/value pair annotation (IT)

The following is a sample of the German text, in fully-fledged ELM feature structure annotation (cf. level (c) above):

         Mexiko    [ pos=noun & type=prop ]

              :    [ pos=resid & type=punct & punct-t=c-final ]

            Die    [ pos=art ]

              "    [ pos=resid & type=punct & punct-t=non-c-final ]

 Praesidentenmache [ pos=noun & type=com ]

              "    [ pos=resid & type=punct & punct-t=non-c-final ]

              .    [ pos=resid & type=punct & punct-t=c-final ]

   Mexikanische    [ pos=adj & use=attr ]

        Politik    [ pos=noun & type=com ]

            ist    [ pos=verb & type=aux & fin=fin & ( vm-f=ind | vm-f=konj )]

       Almachie    [ pos=noun & type=com ]

            und    [ pos=conj & type=coord ]

      Tradition    [ pos=noun & type=com ]

              .    [ pos=resid & type=punct & punct-t=c-final ]



next up previous contents
Next: The tagset mapping exercise Up: Validation phase Previous: Validation phase