Re: simple vs. complex parsers

Chris F Clark <cfc@world.std.com>
27 Apr 2003 17:14:09 -0400

From comp.compilers

Related articles
[3 earlier articles]
Re: .NET Compiler for Interactive Fiction david.cornelson@iflibrary.com (David A. Cornelson) (2003-03-14)
Re: .NET Compiler for Interactive Fiction tbandrow@unitedsoftworks.com (2003-03-16)
Re: .NET Compiler for Interactive Fiction JeffKenton@attbi.com (Jeff Kenton) (2003-04-05)
Re: .NET Compiler for Interactive Fiction joachim_d@gmx.de (Joachim Durchholz) (2003-04-13)
Re: parsing, was .NET Compiler for Interactive Fiction rpboland@math.uwaterloo.ca (Ralph P. Boland) (2003-04-15)
Re: parsing, was .NET Compiler for Interactive Fiction bobduff@shell01.TheWorld.com (Robert A Duff) (2003-04-20)
*Re: simple vs. complex parsers cfc@world.std.com (Chris F Clark)* (2003-04-27)**
Re: simple vs. complex parsers joachim_d@gmx.de (Joachim Durchholz) (2003-05-05)
Re: simple vs. complex parsers bobduff@shell01.TheWorld.com (Robert A Duff) (2003-05-15)
Re: simple vs. complex parsers cfc@shell01.TheWorld.com (Chris F Clark) (2003-05-18)
Re: simple vs. complex parsers schmitz@essi.fr (Sylvain Schmitz) (2003-05-18)
Re: simple vs. complex parsers mal@wyrd.be (Lieven Marchand) (2003-05-18)
Re: simple vs. complex parsers bs@research.att.com (2003-05-18)
[3 later articles]

| List of all articles for this month |

From:	Chris F Clark <cfc@world.std.com>
Newsgroups:	comp.compilers
Date:	27 Apr 2003 17:14:09 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	03-02-125 03-02-147 03-03-043 03-03-061 03-03-103 03-04-006 03-04-028 03-04-046 03-04-066
Keywords:	parse, design
Posted-Date:	27 Apr 2003 17:14:09 EDT

Robert Duff wrote:
> If we didn't design our programming languages to have such doggone
> complicated syntax, then we would be happy to use hand-written
> recursive-descent parsers.

Note, if I could post this reply on April 1 I would, but that is past
this year. Moreover, I'm actually truly serious about what I say,
even though I do manage to insult just about everything I respect in
the process. So, please take the criticisms in this posting lightly.
I'm hardly claiming that I could do better. I just hope to get myself
a slightly better vantage point by standing on the toes of giants,
since I can't reach their shoulders....

Unfortunately, it is not "complicated syntax" that is the problem. In
fact, it is actually the fact that many languages are initially parsed
with hand-written recursive-descent parser generators that is the
problem.

And the reason this is the cause is that there is no checking of a
hand-written parser to verify that the language it parses has a simple
unambiguous grammar. In particular, it is easy in a hand-written
grammar to introduce a special case in one branch of an if-tree that
causes the parses of that branch of the if-tree to take a slightly
different syntax than the parses of other branches. However, when you
attempt to rewrite this with an unambiguous grammar, one discovers
that one has to double the number of productions in the sub-tree.

This problem is compounded by the fact that it is often easy to mingle
both syntactic and semantic checks in the hand-written parser. As a
result, some semantic consideration, which may be the result of some
global property of the program, influences the syntax in some obscure
way.

Of equal importance, the first prototype parsers for a language
generally do very little error detection. Moreover, the users of the
language use the feature in stylized ways, often close to the original
author of the prototype's conception. Thus, they do not see that the
particular "feature" introduced in one part of the grammar, interacts
poorly with other parts of the grammar. As a result, the feature gets
incorporated into the grammar in ways where the ambiguity goes
undetected until there is already a sizeable body of code depending on
the ambiguous syntax.

----------------------------------------------------------------------

Now, this is not a symptom of only complex languages. Even Pascal has
an instance of this problem.

(Steve_Lipscombe@amat.com wrote: Can you guess I like Pascal, which
was designed from the outset for a single pass RDP?)

In particular, it is difficult to disambiguate the following two
correct lexical phrases: int ".." int and float. It simply cannot be
done with one character of lookahead. If I recall correctly, the
grammar is such that both alternatives are not legal at the same time,
so if the lexer and the parser are written together, it is not a
problem. However, if one uses a separate lexer, one needs to use a
lexer that processes more than one character of lookahead to determine
the ends of tokens.

And if someone as competent as Niklaus Wirth cannot design a language
as simple as Pascal without introducing such problems (and his later
revisions such as Modula[-2] do not fare much better having their own
little syntactic quirks), who truly believes that they can do better
on their own? K&R C is of similar complexity to Pascal and has its
own syntactic problems where specific implementation decisions show
through in the syntax and variant dialects. I believe similar
problems beset Knuth in TEX. And, the only parser for Perl that I'm
aware of is the one in Perl. Even the highly simple lisp has
syntactically incompatible dialects.

Of course, complicated languages with large syntaxes increase the
opportunities for these little inconsistencies to creep in. C++ being
the canonical example of that, with more syntactic warts than any
other language that I have ever attempted to write a parser for.

----------------------------------------------------------------------

My favorite example of the problem from C++ is the syntax of
constructors and initializers. Instead of leveraging the C
initializer syntax, Stroustrup decided to use a function call like
syntax for constructors, where the arguments to the constructor are
enclosed in parenthesis after the declaration they are initializing.

To me this clearly indicates the the original C++ translator was
written with a hand-written recursive-descent parser. As any parser
generator would have quickly pointed out that the rules describing
constructor initialization calls were ambiguous with function
declarations. However, that wasn't checked out and two rules for C++
parsing resulted from that ambiguity. 1) If it looks like a
declaration, it is a declaration. 2) The no argument constructor
cannot be specified by "()"--that is reserved for a C-style function
declaration.

Jim Roskind, who successfully made a conflict free grammar for ANSI-C
by resolving the tyepdef problem with additional parsing productions,
attempted to do the same thing for C++. However, the best he could do
was a grammar that in his (approximate) words "pushed the problem far
enough away that any ambiguous cases were too difficult for a human
reader to distinguish to understand."

Those problems in parsing C++ have partially spurred some of the new
parsing technologies, especially in the areas of backtracking parsers
and GLR parsing.

----------------------------------------------------------------------

Now, as I said, I think the root cause of the problem is *HAND-
WRITTEN* recursive-descent parsers, with the emphasis as I placed it
on hand-written. Recursive-descent is a wonderful parsing technology,
especially in its ease of understand and "debugging" the resulting
parser.

However, the problem is in the tendency (particularly in hand-written
ones), to hack the problem close to its point of discovery. Local
patches to grammars make for significant problems.

The only correct methodology in my book is to modify the grammar, and
to get it to pass through the parser generator without introducing
conflicts or ambiguity warnings. There may still be problems with the
grammar or parser. However, you have passed some mechanical checks
that says that you haven't introduced any constructs that your tool
didn't believe it couldn't handle--and thus it is likely that another
writer will be able to construct a grammar for their tool too.

----------------------------------------------------------------------

My final thought is about parsing technologies. Some of the advances
have come from trying to parse problem languages like C++. However,
other advances have come from trying to make the parsing process
itself simpler. Much of the extension work has been to reduce the
distance between the number of grammars one can write (but are
unusable) and the number of grammars that the tools accept. The ideal
goal is a parser generator that can take "any" grammar infer what
language it is intended to describe and create a correct parser. That
goal is unattainable, but we can get closer with tools that allow us
to parser broader classes of languages and ones that make specifying
exactly which language is desired easier.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: simple vs. complex parsers

Chris F Clark <cfc@world.std.com>27 Apr 2003 17:14:09 -0400

Chris F Clark <cfc@world.std.com>
27 Apr 2003 17:14:09 -0400