Related articles |
---|
Compiler Books? vicky7909@rediffmail.com (2003-10-27) |
Re: Compiler Books? vbdis@aol.com (2003-10-31) |
Re: Compiler Books? Parsers? napi@cs.indiana.edu (2003-11-01) |
Re: Compiler Books? Parsers? napi@cs.indiana.edu (2003-11-01) |
Re: Compiler Books? Parsers? henry@spsystems.net (2003-11-02) |
Re: Compiler Books? Parsers? henry@spsystems.net (2003-11-08) |
Re: Compiler Books? Parsers? Jeffrey.Kenton@comcast.net (Jeff Kenton) (2003-11-21) |
Re: Compiler Books? Parsers? cfc@shell01.TheWorld.com (Chris F Clark) (2003-12-03) |
RE: Compiler Books? Parsers? qjackson@shaw.ca (Quinn Tyler Jackson) (2003-12-03) |
Re: Compiler Books? Parsers? rbates@southwind.net (Rodney M. Bates) (2003-12-08) |
Re: Compiler Books? Parsers? nick.roberts@acm.org (Nick Roberts) (2003-12-08) |
Re: Compiler Books? Parsers? marcov@stack.nl (Marco van de Voort) (2003-12-20) |
Re: Compiler Books? Parsers? cfc@world.std.com (Chris F Clark) (2003-12-21) |
Re: Compiler Books? Parsers? cdc@maxnet.co.nz (Carl Cerecke) (2003-12-23) |
[6 later articles] |
From: | Chris F Clark <cfc@shell01.TheWorld.com> |
Newsgroups: | comp.compilers |
Date: | 3 Dec 2003 19:40:57 -0500 |
Organization: | The World Public Access UNIX, Brookline, MA |
References: | 03-10-113 03-10-145 03-11-010 03-11-083 |
Keywords: | tools |
Posted-Date: | 03 Dec 2003 19:40:57 EST |
Mohd Hanafiah Abdullah asked:
> I have never used parser generators such as yacc/bison or antlr.
> Would be nice to know people's opinion on manual implementation vs
> automation of parsers, especially from those who have done both; in
> terms of efficiency, readability, maintainability,
> ease-of-development, and so forth.
Jeff Kenton replied:
> 1. Efficiency: someone I work with, who's often right, claims that you
> can always write a parser by hand that's faster than what yacc and lex
> produce. Other people claim exactly the opposite.
(Preface: my responses are *very* biased as I am the author of a tool
which is "actively" marketed!)
This is trivially true, if one considers the fact that one can take the
output of any given tool and then profile and modify it so that the
resulting hand-written tool is faster.
However, in contrast, if someone figures out how to write a faster
output for a specific tool, you may have to rearchitect your
hand-tweaked to take advantage of those improvements. Note, that
that doesn't happen very often (in 15 years of working on a tool,
there were approximately 4 times where we specifically worked on
speeding the generated parsers up).
All that being said, generally hand-written tools can be faster
than machine generated ones (provided you know what you are doing).
The difference in execution speed may or may not be significant.
If you don't know what you are doing, machine generated ones are
likely as fast. More importantly, this can be a case of premature
optimization. Why hand-craft a fast lexer or parser if that's not
the bottleneck in your application? You would be better off,
measuring the real bottleneck and fixing that!
As evidence of the above, in several years of using automated tools
to build a variety of projects for clients, I have only twice had
times where adjusting the lexing/parsing code made a significant
impact on the client applications and these applications (the ones
where it didn't matter too) were essentially parsers. Most of the
time, fixing up malloc/free and other low-level issues were more
important.
> 2. Readability and maintainability: probably goes to the tools,
> although some of that is superficial. There's often a lot of magic
> hidden in the details that isn't as obvious on second read as you
> thought it was the first time through.
>
> 3. Ease of development: here's where I dislike lex and yacc. The
> actual code produced is impossible to read, and it's really painful to
> debug when you get it wrong. Furthermore, you always have to invent
> trickery to handle certain cases. It's often at the boundary between
> lexing and parsing, or between the parser and whatever calls it, where
> you start to think the tools are hurting you more than they're
> helping.
These are the real issues. Your end goal should probably be to
produce a correct lexer/parser for your target language with the
minimum effort, so you can expend that effort on some other part of
the tool.
A real-life eample is particularly illustrative here. I work with
a group that builds a Verilog compiler for use by chip designers.
Verilog is a typically nasty language, with a C-like (but not C)
preprocessor, places where whitespace is important and places where
it isn't, messy identifier and numeric rules, and what-have you.
In our product, we have at least two different Verilog parsers.
One is hand-written and one is generated by a tool.
The hand-written one was borrowed from a previous product and then
slowly has slowly gathered more frills, knots, and warts. The
hand-written one parses only a subset of Verilog (although the
subset continually grows) and silently skips any text it can't
parse. That form of error-recovery is important for it. It is used
to parse texts that may or may not be correct Verilog (e.g. code
still being written, text documents that aren't intended to be
Verilog, binary files, what-have-you). It's goal is to extract as
much valid Verilog as it can out of said documents, syntax
highlight them and do some other "trivial" tasks--some of the tasks
not being trivial at all. It has taken a couple of person years to
get it to parse what it does as correctly as it does, not counting
the time to initially write the precursor version.
The machine generated one was created by transliterating the 1995
IEEE spec into the dialect that the generation tool uses and making
some adjustments to account for the hard-to-parse parts and for
local dialect extensions. In addition, the language of the tool
was used to describe the resulting tree we wanted built. A second
pass was added to translate the tree that the generator automatic-
ally builds into a different representation used by the "back-end"
of our product. The initial coding of the gramar took under a
month, and since that time we have made only a few changes to it,
on the order of two-more person months. However, the translation
phase to the second representation was a far larger effort and has
consumed on the order of one to two person years. Only about six
months ago did we feel that we had the translation fully stable.
At this point, I'm willing to draw the following conclusions:
1) The machine generated parser is generally more reliable, more
robust, and easier to maintain than the hand-written one.
Moreover, no one is completely sure what dialect the hand-written
one actually parses. In addition, the hand-written one has spawned
numerous spin-offs, where part of the hand-written one was cloned
and reused in other parts of the product (to parse other small
fragments of the input).
However, the hand-written one *DOES* do a far better job of
responding to ill-formed texts. Also, the architect of the
hand-written one is far more comfortable with it and finds its code
simpler to follow. Most parser generators do a good job at
generating parsers which handle correct input texts or texts with
some small amounts of errors in them that the user needs to correct
(and even there certain lexical errors typically destroy the
parsers ability to recover). They do not do a good job at
generating parsers that can pick fragments of correct code out of a
sea of garbage, nor any other tasks they weren't designed to do.
But, note that some parser generators (e.g. Quinn Tyler Jackson's
meta-S) are now beginning to support a grep-mode, which may solve
that problem.
2) The machine generated one was easier to get to a high-level of
functioning first. For complicated languages it is easier to find
the right hacks to apply to a machine generated parser than it is
to write the same correct code by hand. If we had been able to use
the parse (AST) tree coming out of the parser as our internal
representation, the parsing part of the project would have been
done in a trivial amount of time.
However, a parse tree is often not the correct internal representa-
tion of the input text. It often takes a lot of work to generate a
data structure that has the correct semantics (especially when it
isn't clear what the correct semantics are--several of our internal
representations have had major shifts in how they model the
semantics to reflect our evolving understanding of how to map the
input text into a working abstract model). As far as I can tell,
parser generators have not eliminated that work. That said, we did
not used a tool with attribute grammar support (such as Eli) and
perhaps some of those issues would have been mitigated through
that. Still, the hard part is often interfacing the output of the
parser generator with parts of the project that may have stringent
requirements that aren't "compatible". However, my impression is
that these tasks are intrinsically hard and that hand-generating
the parser would not have mitigated them (and would have made them
harder, since a hand generated parser does not build an AST
representation automatically, and that was a stepping stone at
least). I know that's true for the project, I'm working on. In
fact, I want more automation, not less.
I have ideas (drawn from my pratical experience) that will be added
to the tool over time. Some of them will improve that problem.
None of them are likely to be magic wands.
3) The output of a tool (and its associated library, if any) is more
opaque than hand-written code. Moreover, the more a tool enables
you to do (and worse, enables you to do efficiently) only makes the
problem worse. Doing a task takes code. Doing a generalized
version of the same task takes even more code. Do an efficient
generalized version of the same task takes sill more code. Doing
that and also including options for tasks that aren't being done
takes even more. . . .
I have not yet seen a tool that generates totally simple code and
code that handles all the complicated cases easily. Again, I have
some ideas here (on how to make the output more tailorable so that
simple problems can get simpler code generated), but the problem is
intrinsically difficult. The two items are polar opposites. You
can't get both at the same time. The best one can hope for is a
balance that is right for ones own project.
That said, the tradeoff in having already (mostly?) debugged code
that implements some of the things that I will need (e.g. nested
include file support, AST construction, i18n error meesages)
already in the library, lets me build most things faster and not
have to debug my own typo errors as I reimplement them. That
weighs heavily in faovr of a tool (and library) in my book. I'll
take a little opaqueness to get that.
Some relevant statistics: I cannot recall the last time I debugged
the *code* generated by the tool. I debug approximately two
library problems a year (across the entire tool community's user
base). I debug approximately four "table generation" problems per
year, mostly dealing with features in the tool that aren't in the
underlying theoretical model. Most of the bugs are both obscure
(happening in corners of the feature set) and obvious (when the
failure happens the tool or the generated application fails in a
dramatic way). There has so far been only one bug in the tool that
has affected the project above I was describing. There was also
only one case where the grammar did not describe the language
correctly and the grammar needed to be debugged.
Recommendations:
If you are doing something trivial or need some obscure ability not
supported by any tool you can find, don't use a tool. We will
probably never replace the hand-written parser in our product with
a machine generated one (although we consider it from time to time,
especially when needing to parse in a new context), because we
don't trust the tool we use to gracefully implement a grep/
maximal-munch-correct-fragments-only type parsing. In addition,
sometimes, when I have to implement a small state machine, I will
do so by hand, espcially in embedded contexts where firing up a
whole object created by the tool would be expensive at runtime and
would provide no explicit benefits.
However, if you have a non-trivial language that may grow and/or
change, get the best (most automated) tool you can. It *will* save
you work. Especially, if that tool goes beyone simple lexing and
parsing to AST construction (or other features relevant to your
application). For example, ask someone about JavaCC with JJTree or
JTB or ask someone about ANTLR (aka PCCTS ver 2) or about PCCTS ver
1 and Sorcerer. There are news-groups for both tools
(comp.compilers.tools.javacc and ==.pccts), so it shouldn't be hard
to get info and opinions, with the one caveat being that many of
the posters are "new" users (e.g. students or experimenters doing
projects, who currently have questions) so it is hard (but not
impossible) to get a response from a seasoned user. Most seasoned
users have simply finished their parsing part and moved on to other
tasks--and that's the point.
Hope this helps,
-Chris
*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)
Return to the
comp.compilers page.
Search the
comp.compilers archives again.