Re: simple vs. complex parsers

Chris F Clark <cfc@shell01.TheWorld.com>
18 May 2003 01:32:13 -0400

          From comp.compilers

Related articles
[6 earlier articles]
Re: .NET Compiler for Interactive Fiction joachim_d@gmx.de (Joachim Durchholz) (2003-04-13)
Re: parsing, was .NET Compiler for Interactive Fiction rpboland@math.uwaterloo.ca (Ralph P. Boland) (2003-04-15)
Re: parsing, was .NET Compiler for Interactive Fiction bobduff@shell01.TheWorld.com (Robert A Duff) (2003-04-20)
Re: simple vs. complex parsers cfc@world.std.com (Chris F Clark) (2003-04-27)
Re: simple vs. complex parsers joachim_d@gmx.de (Joachim Durchholz) (2003-05-05)
Re: simple vs. complex parsers bobduff@shell01.TheWorld.com (Robert A Duff) (2003-05-15)
Re: simple vs. complex parsers cfc@shell01.TheWorld.com (Chris F Clark) (2003-05-18)
Re: simple vs. complex parsers schmitz@essi.fr (Sylvain Schmitz) (2003-05-18)
Re: simple vs. complex parsers mal@wyrd.be (Lieven Marchand) (2003-05-18)
Re: simple vs. complex parsers bs@research.att.com (2003-05-18)
Re: simple vs. complex parsers nmm1@cus.cam.ac.uk (2003-05-23)
Re: simple vs. complex parsers bs@research.att.com (2003-05-23)
Re: simple vs. complex parsers antkaij@mit.jyu.fi (Antti-Juhani Kaijanaho) (2003-05-24)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 18 May 2003 01:32:13 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 03-02-125 03-02-147 03-03-043 03-03-061 03-03-103 03-04-006 03-04-028 03-04-046 03-04-066 03-04-116 03-05-103
Keywords: design
Posted-Date: 18 May 2003 01:32:12 EDT

Robert A Duff <bobduff@shell01.TheWorld.com> writes:


> I truly hope I have not offended you (Chris Clark) by denigrating parser
> generator technology! My point was really more of an attack on existing
> programming languages.


Not offended, not even close. I think your point about attacking the
sad state of programming languages is not amiss. I just think the
root cause was misplaced.


> Anyway, let me say this: I could write a hand-written recursive-descent
> parser for (some dialect of) Lisp, better and cheaper than you could
> create a tool-generated parser for, say, C++ or Ada.


Yes, probably not for Common Lisp, but probably for Scheme. and yes
there is a lesson there. No question. Simpler languages are better!


Me:
> > And the reason this is the cause is that there is no checking of a
> > hand-written parser to verify that the language it parses has a simple
> > unambiguous grammar. In particular, it is easy in a hand-written
> > grammar to introduce a special case in one branch of an if-tree that
> > causes the parses of that branch of the if-tree to take a slightly
> > different syntax than the parses of other branches. However, when you
> > attempt to rewrite this with an unambiguous grammar, one discovers
> > that one has to double the number of productions in the sub-tree.


Bob:
> I don't understand that. Perhaps you can give an example?
> I'm not even sure whether the 'if's you're talking about are in the
> hand-written parser, or in the language being parsed...


I meant in the parser. Often it is the case, that one has one
particular place where one would like to make something easier, and
one just introduces a local patch in the hand-written parser to enable
that feature in the one spot where it is desired. The problem is that
such local hand-written patches often tend to actual make the "formal"
version of the grammar much harder to write.


Just recently, someone was asking how to distinguish expressions
without a top-level of parenthesis from ones that have outermost
parenthesis--seemingly a trivial task. It is certainly something one
could easily hack into a recursive descent parser. The problem is
that generally when doing such things one needs to double the number
of expression productions to express the concept formally. Doubling
anything is generally a way of making that thing harder. Moreover, it
tends to also cause one to attempt short-cuts, so that one hasn't
really doubled the work.


So, to my mind, such a feature in a programming language is
ill-conceived. The person should really re-think what they are
attempting to do. It is unlikely that the absense of top-level
parenthesis is a good signal for whatever semantics they want.


However, if one was working on a hand-written recursive-descent parser
for the language, one might simply introduce the hack that checked for
the top-level parenthesis in the context where they cared about this
semantic wrinkle and never thought twice. Doing so, they just left a
legacy of more work for everyone who follows them and wants to use a
formal tool.


And perhaps, the problem is in the formalism(s) of our tools. I might
feel differently if there was an easy way of writing a grammar that
captured the lack-of-outer-parenthesis idiom easily, clearly, and
economically (and perhaps there is and I'm just not aware of it--one
poster suggested something that looked not-too-bad to me for this one
case). However, in general, I feel that one *SHOULD* write all their
parsers using tools, because if it is hard to write with a tool (that
protects you formally), then you probably don't have any business
using that construct as part of your syntax.


That's not to say that there aren't a lot of bad grammars one can
write using tools, grammars that make the resulting languages hard for
humans to understand, but it does protect us from certain abuses--the
ones that the grammar generators make difficult to write.


> My approach to language design is to write a grammar before writing a
> parser. Not to start hacking on a parser and see what sorts of input it
> happens not to choke on!


Good for you. I'm convinced that far too many people don't do that.
My problem is rarely with designers who work through what they are
doing. It is more with crude-axe hackers that simply kludge something
together without any thought of the implications of their half-baked
ideas. And that's my problem with hand-written recursive-descent, it
is simply too easy of a tool that allows too many ill-conceived
notions to be partially implemented without cluing the clueless in
that they haven't dotted-their-i's-nor-crossed-their-t's.


Me:
> > To me this clearly indicates the the original C++ translator was
> > written with a hand-written recursive-descent parser.


Bob:
> Does anybody here know if it was? I have no idea.
>
> Do you really think Stroustrup was surprised by this ambiguity?
> I have the impression he knew about it, and designed it that way
> anyway. I think he mentioned it in his book.


Well, then so much the worse. I always tried to give Stroustrup the
benefit of the doubt in these cases, since I knew he had some trouble
expressing his ideas in Yacc and switched to recursive-descent, or so
I recall being told. C++ was bound to be a big and problematic
language no matter what he did--it's just too big a cat to skin.
However, if he knowingly allowed some of the problems in, ones that
had simpler and cleaner solutions, then he deserves more cursing than
I will ever be able to muster.


It's bad enough when an inexperienced neophyte guts lured into the
trap of hand-writting a recursive-descent parser because it seems so
easy and no one has taught them better, but a seasoned professional
should know that reliability and robustness are not something we
should trust to an error-prone way of coding.


> LR parsing, however, is something of a mystery to me. I *sort of*
> understand how it works, but slogging through those tables is a
> nightmare.


Then, I would recommend a parser generator that outputs recursive-
descent code, e.g. ANTLR, JavaCC, Meta-S, or RDP. (As far as I can
tell, all of the mentioned tools are more than just good, most are
excellent and some are awe-inspiring.)


On the other hand, I rarely look at the output of the LR tool I
use--not quite true as a maintainer, I get to look at it often when a
user says "I think it might be broken in this case", but essentially
never for grammars I write. Part of that is because I tend to write
similar grammars all the time, and I know a few idioms/patterns that
tend to solve the grammatical problems that I face. As a result, I
don't often have to debug my grammars--they parse what I want. And as
such, were I to design a new language, it would use only those
features I already find it easy to write grammars for--and not all of
them as some things I can easily make a grammar do are not fit for
human consumption.


Perhaps, a hand-written recursive-descent parser writer would say the
same thing. I only use the tricks that are easy for me to write and
not all of them. However, there is one key difference. When I get
done, I have a tool that formally verifies that I haven't done
something stupid by accident, just like my strong-typing programming
language keeps me from making certain errors by accident (and punishes
me unmercifully when I do).


The hand-written parser writer is like the person who uses lisp to
extend emacs (and yes I do that). The writer does not know if the
code won't wander down some unexpected path through the code and give
a runtime error. You can get the code right for simple cases, but
eventually you write something complex enough that some error slips
in, and there is nothing to prevent you from shipping that child with
the error uncorrected.


Parser generators are not a panacea (neither are strong typing
systems) and sometimes they are a pain, but they do make a difference
in terms of certain typographical errors. A difference that helps me
sleep at night. And one that I wish more people took advantage of!


I wrote about hoping for more powerful tools to which Bob replied:
> Hmm. I'm not sure I agree. Do we really want tools that can parse
> ambiguous grammars, if that encourages language designers to create
> ambiguous grammars? Are ambiguous grammars good for human readers?


No, I don't like ambiguous grammars (generally). In fact, my problem
with Tomita (GLR) parsers is that they don't distinguish between
grammars that they can parse unambiguously and ones that they can't.


That said, I think that if a grammar can be used in a way that makes
it unambiguous as to what it parses, even if the grammar is
technically ambiguous when that interpretation of its meaning is not
imposed, then I would consider it fair game, provide that there is a
tool that can provide the proper insurance that the unambiguous
interpretation is truly unambiguous.


As such, I like the precedence/associativity declarations of yacc for
dealing with expressions as opposed to the more cumbersome way of
writing the expressions correctly that most LL tools require. The
yacc notation is not unproblematic and it can be misused to allow
terrible grammars to be written. However, it makes writing a correct
expression part of a grammar so simple, that it has a place in my
book. Again, the tool is letting me write something that I know is
correct more easily than something I don't trust. That is always a
good thing in my book. I just use it cautiously, because it is not
perfect.


(Here is something worthwhile that someone should write--a tool that
takes a yacc expression grammar with precedence and associativity and
expands it into its rigorous cousin--and vice-versa. That way one
could use simple "precedence tables" i.e. * binds tighter than +,
without worrying about the precedence of + or * being misapplied in
some unrelated part of the grammar.)


Enough of my opinions for one day!
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.