next up previous contents
Next: The tagging vs. tagset Up: Validation phase Previous: The tagged reference corpora

The tagset mapping exercise

 

The tagset mapping exercise was carried out with the following objectives:

Mapping rules deal with 1:1, 1:n, n:1 and n:m cases. A small scale exception lexicon was created for idiosyncratic cases (e.g. that as `IN' ( = preposition) in the UPenn tagset).

The following are a few sample mapping rules, for the BNC:

  cqp_name(upenn, 'BNC').

[pos = 'AJ0']   =>      [adj & po].

[pos = 'AJC']   =>      [adj & comp].

[pos = 'AJS']   =>      [adj & sup].

[pos = 'AT0']   =>      [art].

[pos = 'AV0']   =>      [adv & (general '|' degree)].

[pos = 'CJC']   =>      [conj & coord].

[pos = 'CJS']   =>      [conj & subord].

[pos = 'CRD']   =>      [numeral & card].

[pos = 'DT0']   =>      [det & (indf '|' dem) '|' pron & dem].

[pos = 'DTQ']   =>      [det & wh].

[pos = 'EX0']   =>      [unique & existential].

[pos = 'NN0']   =>      [noun & com].

The following are a few sample entries from the exception lexicon:

[no]      << [pos = 'AT0'] >> [det & indf].

[that]    << [pos = 'CJT'] >> [conj & subord '|' pron & rel].

[of]      << [pos = 'PRF'] >> [conj & subord].

['bound'] << [pos = 'VVN'] >> [verb & s_aux].

['going'] << [pos = 'VVG'] >> [verb & s_aux].

If used with the LIQUY tagset mapping tool, the following output is generated; the queries serve to retrieve BNC evidence by using ELM-EN as a query language:

The latter example shows the treatment of noise and silence in non-1:1 situations.