Next: Bibliography Up: Component Technologies Previous: Proper Noun Recognition and

Parsing

Introduction

In the context of natural language processing, parsing may be defined as the process of assigning structural descriptions to sequences of words in a natural language (or to sequences of symbols derived from word sequences). Just what sort of structural description is assigned and on what grounds depends on the grammar - a description language plus set of structural constraints - according to which the parser attempts to analyse the symbol sequences presented to it. Put another way, a parser accepts as input a sequence of words (or their surrogates) in some language and an abstract description of possible structural relations that may hold between words or sequences of words in the language, and produces as output zero or more structural descriptions of the input as permitted by the structural rule set. There will be zero descriptions if either the input sequence cannot be analysed by the grammar, i.e. is ungrammatical, or if the parser is incomplete, i.e. fails to find all of the structure the grammar permits. There will be more than one description if the input is ambiguous with respect to the grammar, i.e. if the grammar permits more than one analysis of the input.

The input symbol sequence to a parser may or may not consist solely of words in a natural language. Leaving aside parsing of artificial languages (e.g. programming languages or logical languages) and of annotated documents (e.g. SGML) and of non-linguistic sequences such as gene sequences, parsing in NLP may be of word sequences, part-of-speech tag sequences, or of sequences of complex symbols such as feature bundles (e.g. where a word may have been replaced by a set of features including its orthographic form, part-of-speech, inflectional class, etc.).

Generally speaking, the motivation for parsing is the belief that grammatical structure contributes to meaning and that discovering the grammatical structure of an NL word sequence is a necessary step in determining the meaning of the sequence. In some parsers the construction of a meaning representation is carried out in parallel with the derivation of a structural analysis according to the grammar.

Applications of parsing include everything from simple phrase finding, e.g. for proper name recognition (§5.4), to full semantic analysis of text, e.g. for information extraction (§ 4.3) or machine translation (§4.1).

Survey of Approaches

Parsing has been of central concern in computational linguistics and related areas of application for well over thirty years and any attempt to review these approaches here is bound to be both shallow and incomplete. Approaches to parsing may be classified according to a number of characteristics of the parser.

1.

Grammatical approach. Many grammatical theories which purport to describe NL have been advanced and for most of them some sort of automatic parser has been implemented. No attempt can be made here to review all of these theories so we limit ourselves to references to some of the grammatical theories which have been most influential in the computational analysis of language. These include:

(a): phrase structure grammar [Cho57,Gaz85,Pol94]
(b): tree adjoining grammar [Jos75,XTAG]
(c): categorial grammar [Bar53,Lam58]
(d): dependency grammar [Tes59,Mel88]
(e): transformational grammar [Cho65]
(f): government and binding/principles and parameters [Cho82,Hae94]

2.

Deterministic/non-deterministic. Most grammars are ambiguous and parsers need to adopt some strategy with respect to handling ambiguity. This may range from calculating and retaining all combinatorially possible structures allowed by the grammar (usually employing an efficient data structure such as a chart) to discarding all but one possibility, by pruning either on probabilistic evidence (probabilities are associated with grammar rules and the parser manipulates these) or psycholinguisitic evidence.

3.

Control strategy. A parser must decide between parsing a sentence left-to-right or right-to-left, between parsing bottom-up in a data-driven mode or top-down in an expectation-driven mode and between depth-first and breadth-first exploration of its search space.

While, as noted, for most grammatical theories some attempt has been made to implement an automatic parser, most current applied computational work is taking place in the phrase structure framework ^5.12 and can be described under two heads:

1.: Finite state/chunking approaches For identifying simple phrase structure (either no recursion, or only left- or right-recursion) a regular phrase structure grammar is sufficient, and can be implemented by a finite state automaton which is extremely efficient. Such devices, or cascades of such devices have recently attracted interest because they provide robust, high speed mechanisms for detecting basic phrase structure in large volumes of text. See, e.g. [Kar96,Abn96].
2.: Augmented CFG approaches Phrase structure approaches based on context-free grammars augmented with feature structures have also been pursued at some length in applied systems as they offer greater expressive capacity without necessarily becoming computationally infeasible. See, e.g., [Als92,Tom87].

Relevant Notions of Lexical Semantics

Depending on the grammar a parser uses to analyse its input, lexical semantics may play no role whatsoever, e.g. in parsing syntactic tag sequences, or an extensive role through, e.g., the application of semantic constraints or preferences supplied via lexical entries. Lexicalised grammars, such as recent versions of HPSG [Pol94] and LTAG [Jos91] provide natural frameworks in which complex semantic constraints may be expressed. But simpler frameworks, e.g. tag-based finite-state or CFG parsing schemes, may be adapted to take simple lexical semantic information into account by using semantic tag classes as well as part-of-speech tags (e.g. tagging nouns as LOCATION, PERSON or ORGANISATION). Many so-called pattern-matching information extraction systems use this sort of lexical semantic information, employing patterns triggered by the presence of key lexical items together with surrounding fillers of the appropriate semantic type. Such approaches tend to be highly domain and genre specific.

NLP applications using Parsing Techniques

Machine translation §4.1, information retrieval § 4.2, information extraction §4.3.

Next: Bibliography Up: Component Technologies Previous: Proper Noun Recognition and

EAGLES Central Secretariat eagles@ilc.cnr.it