The SUSANNE Corpus (Surface And Underlying Structural ANalyses of Natural English) is a 130,000 word corpus marked with grammatical tags, surface and logical grammatical annotation. The corpus consists of 64 of the 500 texts in the Brown corpus, and the annotated version is available from the Oxford Text Archive.
The aim of the project was to provide a publicly available standard for grammatical analysis. The scheme is fully explicit and is described in detail in Sampson (1995).
The corpus was manually annotated by a team of linguists and computer scientists.
The syntactic labelling is carried out on three levels:
The surface grammar identifies constituents in a clause and assigns labels to these constituents. The scheme used at this level is very similar to that of the Lancaster scheme, although more detail is included. This detail consists mainly of further subcategorisation of phrases, for example a noun phrase may be marked as singular or plural, common or proper. Grammatical function is also included at this level -- different tags are used for a noun phrase marked as subject or non-subject.
The function tags used in the SUSANNE annotation are shown in table 3.9.
|:o||logical direct object|
|:i||logical indirect object|
|:e||predicate complement of subject|
|:j||predicate complement of object|
|:a||agent of passive|
|:S||surface but not logical subject|
|:O||surface but not logical object|
|:h||manner or degree|
|:n||particle of phrasal verb|
|:x||propositional relative clause|
|:z||complement of catenative|
Ghost nodes (extra nodes dominating no wording) are added to parsetrees to show the logical position of elements that have been moved or deleted in the surface structure. Function tags are added to node labels to mark the logical structure, and indices are also added to mark the relationship between nodes marked grammatically as counterparts, such as a ghost and the corresponding full surface constituent. Ghost nodes compare with traces, as used by the UPenn Treebank. Other grammatical constructions marked include the following:
Table 3.10 provides an example from the SUSANNE corpus (Sampson, 1995) which gives an impression of the various aligned information types that can be given. The columns (i.e. fields) contain the following information:
The following example illustrates the use of a ghost node:
[Nns:s123 John ] wanted [Ti:o s123 to go ]
In this example, `:s' is the Subject function tag, `:o' the Object tag. `Ti' stands for `infinitival clause'. The `:o' tag on `Ti' indicates that the infinitival clause to go is the Object of wanted. The `s123' ghost node indicates the logical position of the surface Subject John. The number `123' is an index to establish the relation between the ghost and its surface realisation.