next up previous contents
Next: The tagging vs. tagset Up: Validation phase Previous: The tagged reference corpora

The tagset mapping exercise

 

The tagset mapping exercise was carried out with the following objectives:

Mapping rules deal with 1:1, 1:n, n:1 and n:m cases. A small scale exception lexicon was created for idiosyncratic cases (e.g. that as `IN' ( = preposition) in the UPenn tagset).

The following are a few sample mapping rules, for the BNC:

  cqp_name(upenn, 'BNC').
[pos = 'AJ0']   =>      [adj & po].
[pos = 'AJC']   =>      [adj & comp].
[pos = 'AJS']   =>      [adj & sup].
[pos = 'AT0']   =>      [art].
[pos = 'AV0']   =>      [adv & (general '|' degree)].
[pos = 'CJC']   =>      [conj & coord].
[pos = 'CJS']   =>      [conj & subord].
[pos = 'CRD']   =>      [numeral & card].
[pos = 'DT0']   =>      [det & (indf '|' dem) '|' pron & dem].
[pos = 'DTQ']   =>      [det & wh].
[pos = 'EX0']   =>      [unique & existential].
[pos = 'NN0']   =>      [noun & com].

The following are a few sample entries from the exception lexicon:

[no]      << [pos = 'AT0'] >> [det & indf].
[that]    << [pos = 'CJT'] >> [conj & subord '|' pron & rel].
[of]      << [pos = 'PRF'] >> [conj & subord].
['bound'] << [pos = 'VVN'] >> [verb & s_aux].
['going'] << [pos = 'VVG'] >> [verb & s_aux].

If used with the LIQUY tagset mapping tool, the following output is generated; the queries serve to retrieve BNC evidence by using ELM-EN as a query language:

The latter example shows the treatment of noise and silence in non-1:1 situations.