|Chomsky Normal Form and parsing email@example.com (1993-02-23)|
|Re: Chomsky Normal Form and parsing firstname.lastname@example.org (1993-02-23)|
|Re: Chomsky Normal Form and parsing email@example.com (1993-02-24)|
|Re: Chomsky Normal Form and parsing firstname.lastname@example.org (1993-02-24)|
|From:||email@example.com (Jonathan Eifrig)|
|Organization:||The Johns Hopkins University CS Department|
|Date:||Wed, 24 Feb 1993 00:29:38 GMT|
In article 93-02-127 firstname.lastname@example.org (BO/EBC/KX/ZG
>I have some more entangled(?) questions on CFG's.
>This time it's Chomsky and Greibach normal forms I'm interested in.
> [definition of CNF grammars deleted]
>The question is: what's the practical use of this?
>Can we parse this Chomsky Normal Form for sure using some parsing method?
>[added later -John]
>I just found a comment on this in a comp.compilers article from May 1992:
Jonathan Eifrig (email@example.com) writes:
> We want don't so much want to recognize a
> *language* so much as generate a parse tree of the input for a particular
> grammar. The *grammar* itself is important; if one has to contort the
> "natural" grammar (i.e., the one that matches the semantic grouping) of
> language in order to accomodate our parsing scheme, this is a serious dis-
> advantage. This is why, for example, we don't just convert our ambiguous
> grammars into Chomsky normal form and apply the obvious dynamic
> programming parsing scheme. We'll see in a minute that both LL(k) and
> LR(1) parsing schemes suffer from this problem.
Wow! Let me tell you, seeing an article you posted almost a year
ago makes one sit up and take notice! Stop the compliments, before they
fool me into writing a book. :-)
As far as the parsing scheme goes, it's very simple. (Like most
algorithms, its obvious, once you see the trick.) Consider a CNF with k
non-terminal symbols, m terminal symbols, and n productions. To answer
"Is string a in the language generated by the grammar?", we do the
Let a[i,j] be the substring of the input from position i to
position j in the input (0 < i,j <= n). We build a table telling us, for
every i and j, which (if any) of the grammar symbols could generate the
string a[i,j]. Once we have the table, we just check to see if the start
symbol could have generated a[1,n].
We build the table by induction on the length of the substrings
a[i,j]. For substrings of length 1, this is easy: A could generate
a[i,i+1] = a if and only if A -> a is a production of the grammar. For
longer strings, we just check exhaustively: For each production A ->BC, we
check to see if there exists a k (between i and j) such that B generates
a[i,k] and C generates a[k+1,j]; since these substrings are shorter that
a[i,j] these answers are already in the table.
A relatively straightforward counting argument shows that this
brute-force approach is even "efficient," at least as far a theorists are
concerned: it takes worst-case O(n^3) time. For a fixed-size grammar,
it's a little better: O(n^2). This is well-known automata theory stuff.
(In fact, it was one of the questions on my qualifier! :-)) For a more
detailed explanation, see my favorite baby automata book "An Introduction
to Formal Languages and Automata" by Linz.
Of course, these are the wrong sorts of parsing questions to ask,
in practice. As I argued above, there's very little use for a parser that
just answers "Yes" or "No"; what one _really_ needs is a parser that
generates a parse tree of the input.
Jack Eifrig (firstname.lastname@example.org) The Johns Hopkins University, C.S. Dept.
Return to the
Search the comp.compilers archives again.