Related articles |
---|
Parser Recognition Strength pt93mhe@student.hk-r.se (Martin Hellspong) (1997-09-12) |
Parser Recognition Strength clark@quarry.zk3.dec.com (1997-09-15) |
Re: Parser Recognition Strength chrisd@etcons.com (cathy hynson) (1997-09-23) |
Re: Parser Recognition Strength Bostjan.Slivnik@ijs.si (Bostjan Slivnik) (1997-09-23) |
From: | clark@quarry.zk3.dec.com (Chris Clark USG) |
Newsgroups: | comp.compilers |
Date: | 15 Sep 1997 21:38:16 -0400 |
Organization: | Digital Equipment Corporation - Marlboro, MA |
References: | <34150373.1EA2@student.hk-r.se> <3415746C.34748C3F@magelang.com> <34180010.19E929E6@student.hk-r.se> 97-09-041 |
Keywords: | parse, LL(1) |
Martin wrote a message explaning a new algorithm called "Evolutionary
Parsing" on comp.compilers.tools.pccts and requested on comp.compilers
info on hard to parse grammars (and info about other parsers which
could handle a hard case they had suggested).
> I am involved in writing my software engineering master thesis
> together with a classmate, and we have developed a quite powerful (in
> our own opinion, anyway) LL parser, and we want to compare
> it to other parsers and algorithms available, with your help...
Algorithm description (from pccts posting):
> 1. First we match the "A", and expect to see a "mid:
>
> top ::= A * mid B
> mid ::= alfa | beta
> alfa::= pre B
> beta::= pre
> pre ::= X
>
> 2. We try _both_ alternatives, and get two "current position" *'s (I
> just number them for readability). We will now try to match one
> "alfa", and one "beta".
>
> top ::= A mid B
> mid ::= *1 alfa | *2 beta
> alfa::= pre B
> beta::= pre
> pre ::= X
. . .
Congratulations, if your parser does what you claim it does [and works
as you claim it works in the posting on comp.compilers.tools.pccts],
then you have figured out how to do a compiled version of Early
parsing. The stars in your explanation match the dots normally used
to indicate items in a description of an Early parser. The point when
your *'s "die" matches when an Early parser eliminates the items.
> In order for you to know something about the power of our algorithm,
> we present a grammar that we can parse, but that we have not found
> any other (see below) parser that can:
>
> top ::= A mid suf
> mid ::= pre suf | pre
> pre ::= B
> suf ::= C
>
> For instance, any LL(k) or LR(k) grammar will fail if you make "pre"
> or "suf" contain more than k tokens.
:
> * Question: Is anyone aware of another parser/algorithm that can
> handle grammars like that?
>
> We are also aware that it is possible to rewrite the grammar so that it
> becomes "easier" to parse, but since our parser can handle this
> grammar without rewriting it, we would be very interested in knowing
> if any other parser can handle the grammar *without* rewriting it.
As to other parser generators which can handle your hard cases, I have
a version of Yacc++ which can handle it, but it is not the version
which is commercially distributed, yet. Also do a web search on
"Tomita" parsing (http:://www.de.relator.research.ec.org/resources/tom)
and on "dotted grammars" (no bookmark for that one) and I think you
will find two other parsers which can handle your hard case.
> * Challenge: We think our parser can handle *any* context-free grammar
> correctly, so we challenge everybody to present us with a really
> difficult grammar (in terms of parsing complexity, not size) that
> you think our parser cannot handle.
> We ARE aware that there are algorithms, (we know of CYK and Earley),
> that can handle any context-free grammar, but we have not found any
> examples of grammars parseable with CKY that we cannot parse. Please
> do hit us with a serious CKY-only CFG!
>
Let me offer a few suggestions for how to modify your hard case to
make it "harder".
1) Allow the Kleene (or positive) closure operator in your pre and suf
rules. A closure operator allows the strings for each reduction to be
indefinitely long and thus not LL(k) or LR(k) for any k.
top: A mid suf;
mid: pre suf | pre;
pre: B+; // one or more B's
suf: C* D; // zero or more C's and then a D
2) Use infix recursion (the kind used to represent parenthesization)
to prevent conversion of left or right recursion into simple
iteration.
top: A mid suf;
mid: pre suf | pre;
pre: B pre B | C; // a C nested in pairs of B's
suf: C suf C | B; // a B nested in pairs of C's
3) Use mutual recursion with a common prefix to guarantee that the
parser must resolve the conflicts backwards. I think the next cases
are unambiguous, but we are at the point where my mind just says that
it is too complex to reason about.
top: A mid suf;
mid: pre suf | pre;
pre: B suf B | C; // the next two nestings alternate.
suf: B pre C | C;
or
top: A mid suf;
mid: pre suf | pre;
pre: B suf B | C; // the next two nestings alternate.
suf: B pre C | C C ; // must unwind to determine if middle is C or C C
4) Use alternating prefixes to break LALR merging.
top: A mid suf;
mid: pre suf | pre;
pre: B suf B | C suf C | D;
suf: B pre C | C pre B | D;
or
top : A mid suf;
mid: pre suf | pre;
pre: B suf B | C pre C | D;
suf: B pre C | C suf B | D;
When you come up with your set of torture cases, let me know. I would
like to have a decent testbase for our version also. BTW, if I
haven't klutzed the above grammars (and made them ambiguous), they all
have the nice property of being parseable in linear time (i.e. you can
determine the correct parse by one linear forward pass and one linear
backward pass). I have a hunch that all CFG's satisfy that property.
I would also recommend asking on comp.ai.nat-lang if you want to
contact people interested in languages which are not LR (and thus
include your hard case). Of course, they are also primarily
interested in grammars with are CSG rather than CFG.
-Chris
======================================================================
Chris Clark email: cfc@world.std.com
weekly: clark@zk3.dec.com
Compiler Resources, Inc. compres@world.std.com
3 Proctor St answering machine: (508)435-5016
Hopkinton, MA 01748 USA fax: (508)435-4847
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.