Re: simple vs. complex parsers

Robert A Duff <bobduff@shell01.TheWorld.com>
15 May 2003 12:34:17 -0400

From comp.compilers

Related articles
[5 earlier articles]
Re: .NET Compiler for Interactive Fiction JeffKenton@attbi.com (Jeff Kenton) (2003-04-05)
Re: .NET Compiler for Interactive Fiction joachim_d@gmx.de (Joachim Durchholz) (2003-04-13)
Re: parsing, was .NET Compiler for Interactive Fiction rpboland@math.uwaterloo.ca (Ralph P. Boland) (2003-04-15)
Re: parsing, was .NET Compiler for Interactive Fiction bobduff@shell01.TheWorld.com (Robert A Duff) (2003-04-20)
Re: simple vs. complex parsers cfc@world.std.com (Chris F Clark) (2003-04-27)
Re: simple vs. complex parsers joachim_d@gmx.de (Joachim Durchholz) (2003-05-05)
*Re: simple vs. complex parsers bobduff@shell01.TheWorld.com (Robert A Duff)* (2003-05-15)**
Re: simple vs. complex parsers cfc@shell01.TheWorld.com (Chris F Clark) (2003-05-18)
Re: simple vs. complex parsers schmitz@essi.fr (Sylvain Schmitz) (2003-05-18)
Re: simple vs. complex parsers mal@wyrd.be (Lieven Marchand) (2003-05-18)
Re: simple vs. complex parsers bs@research.att.com (2003-05-18)
Re: simple vs. complex parsers nmm1@cus.cam.ac.uk (2003-05-23)
Re: simple vs. complex parsers bs@research.att.com (2003-05-23)
[1 later articles]

| List of all articles for this month |

From:	Robert A Duff <bobduff@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	15 May 2003 12:34:17 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	03-02-125 03-02-147 03-03-043 03-03-061 03-03-103 03-04-006 03-04-028 03-04-046 03-04-066 03-04-116
Keywords:	parse, design
Posted-Date:	15 May 2003 12:34:17 EDT

Chris F Clark <cfc@world.std.com> writes:

> Robert Duff wrote:
> > If we didn't design our programming languages to have such doggone
> > complicated syntax, then we would be happy to use hand-written
> > recursive-descent parsers.
>
> Note, if I could post this reply on April 1 I would, but that is past
> this year.

I truly hope I have not offended you (Chris Clark) by denigrating parser
generator technology! My point was really more of an attack on existing
programming languages.

I agree with most of your points. I do admit that parser technology has
helped us understand what constitutes a good syntax for a programming
language.

I actually have mixed feelings. On the one hand, I'm in favor of
automating as much as we reasonably can. On the other hand, when I see
parser generators that have complex features in order to support dealing
with the typedef problem in C, I feel like something's wrong with the
world.

Anyway, let me say this: I could write a hand-written recursive-descent
parser for (some dialect of) Lisp, better and cheaper than you could
create a tool-generated parser for, say, C++ or Ada. This is true (I
claim) despite the fact that you are far more expert in parsing
technology than I am! ("Better" could mean any reasonable combination
of correctness, good error recovery, efficiency, etc.)

(No fair starting from a C++ grammar that already works with your
tools!)

My point was that I think programming language syntax should be closer
to the Lisp end of the spectrum than to the C++ end, although I think
Lisp is a bit *too* simple -- I feel like I'm on a ship, staring at an
endless sea of indistinguishable parentheses stretching to the horizon.
;-)

> And the reason this is the cause is that there is no checking of a
> hand-written parser to verify that the language it parses has a simple
> unambiguous grammar. In particular, it is easy in a hand-written
> grammar to introduce a special case in one branch of an if-tree that
> causes the parses of that branch of the if-tree to take a slightly
> different syntax than the parses of other branches. However, when you
> attempt to rewrite this with an unambiguous grammar, one discovers
> that one has to double the number of productions in the sub-tree.

I don't understand that. Perhaps you can give an example?
I'm not even sure whether the 'if's you're talking about are in the
hand-written parser, or in the language being parsed...

> This problem is compounded by the fact that it is often easy to mingle
> both syntactic and semantic checks in the hand-written parser. As a
> result, some semantic consideration, which may be the result of some
> global property of the program, influences the syntax in some obscure
> way.

"Doctor, it hurts when I ..." "So don't do that." ;-)

Seriously, I agree that parsing should be separated from semantic
analysis (whether or not the parser is hand written). And it is poor
language design to require or encourage feedback from semantic analysis
into parsing (or lexing).

> Of equal importance, the first prototype parsers for a language
> generally do very little error detection.

It seems to me that a minimal requirement for a parser, including the
first prototype for a new language, is to *detect* all syntactically
incorrect programs. Perhaps you meant error recovery, rather than error
detection?

My approach to language design is to write a grammar before writing a
parser. Not to start hacking on a parser and see what sorts of input it
happens not to choke on!

> In particular, it is difficult to disambiguate the following two
> correct lexical phrases: int ".." int and float.

This is hardly a big deal, IMHO.

> And if someone as competent as Niklaus Wirth cannot design a language
> as simple as Pascal without introducing such problems (and his later
> revisions such as Modula[-2] do not fare much better having their own
> little syntactic quirks), who truly believes that they can do better
> on their own? K&R C is of similar complexity to Pascal and has its
> own syntactic problems where specific implementation decisions show
> through in the syntax and variant dialects.

I think K&R C is rather more painful to parse than Pascal.

>... I believe similar
> problems beset Knuth in TEX.

TeX is one of the worst programming languages I have ever seen.
I'm amazed that someone as brilliant as Knuth can be so bad at
language design. For example, you can't tell syntactically how many
arguments are being passed to a macro -- it's determined at run time,
and could be different for different invocations.
Sheesh.

(Yes, I do claim TeX is a *programming* language.)

>... And, the only parser for Perl that I'm
> aware of is the one in Perl.

Another write-only language.

>... Even the highly simple lisp has
> syntactically incompatible dialects.

Lisp suffers from the many-dialect problem. But that's a somewhat
different issue, I think. Each of those dialects has a *very* simple
syntax, compared to most languages.

> My favorite example of the problem from C++ is the syntax of
> constructors and initializers. Instead of leveraging the C
> initializer syntax, Stroustrup decided to use a function call like
> syntax for constructors, where the arguments to the constructor are
> enclosed in parenthesis after the declaration they are initializing.
>
> To me this clearly indicates the the original C++ translator was
> written with a hand-written recursive-descent parser.

Does anybody here know if it was? I have no idea.

Do you really think Stroustrup was surprised by this ambiguity?
I have the impression he knew about it, and designed it that way
anyway. I think he mentioned it in his book.

>... Recursive-descent is a wonderful parsing technology,
> especially in its ease of understand and "debugging" the resulting
> parser.

Yes!

To me, recursive descent is intuitively obvious. In fact, in the first
compiler I ever wrote, when I was 19 (a long time ago), I wrote a
recursive descent parser without having ever heard of the idea before.
Since I was assigned to write a compiler in my job, I signed up for a
night-school class in compiler writing, and when they told us about
recursive descent, I recognized it -- "so *that's* what it's called".
And when they explained left recursion, I thought, "Yup, I had that
bug." ;-)

LR parsing, however, is something of a mystery to me. I *sort of*
understand how it works, but slogging through those tables is a
nightmare.

> However, the problem is in the tendency (particularly in hand-written
> ones), to hack the problem close to its point of discovery. Local
> patches to grammars make for significant problems.
>
> The only correct methodology in my book is to modify the grammar, and
> to get it to pass through the parser generator without introducing
> conflicts or ambiguity warnings. There may still be problems with the
> grammar or parser. However, you have passed some mechanical checks
> that says that you haven't introduced any constructs that your tool
> didn't believe it couldn't handle--and thus it is likely that another
> writer will be able to construct a grammar for their tool too.

Point conceded.

> ----------------------------------------------------------------------
>
> My final thought is about parsing technologies. Some of the advances
> have come from trying to parse problem languages like C++. However,
> other advances have come from trying to make the parsing process
> itself simpler. Much of the extension work has been to reduce the
> distance between the number of grammars one can write (but are
> unusable) and the number of grammars that the tools accept. The ideal
> goal is a parser generator that can take "any" grammar infer what
> language it is intended to describe and create a correct parser. That
> goal is unattainable, but we can get closer with tools that allow us
> to parser broader classes of languages and ones that make specifying
> exactly which language is desired easier.

Hmm. I'm not sure I agree. Do we really want tools that can parse
ambiguous grammars, if that encourages language designers to create
ambiguous grammars? Are ambiguous grammars good for human readers?

- Bob

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: simple vs. complex parsers

Robert A Duff <bobduff@shell01.TheWorld.com>15 May 2003 12:34:17 -0400

Robert A Duff <bobduff@shell01.TheWorld.com>
15 May 2003 12:34:17 -0400