Related articles |
---|
Compiler Compiler Compiler danwang+news@cs.princeton.edu (Daniel C. Wang) (2001-03-22) |
Re: Compiler Compiler Compiler mike@dimmick.demon.co.uk (Mike Dimmick) (2001-03-26) |
Re: Compiler Compiler Compiler joachim_d@gmx.de (Joachim Durchholz) (2001-04-04) |
LL vs. LR, was Compiler Compiler Compiler mike@dimmick.demon.co.uk (Mike Dimmick) (2001-04-10) |
LL vs. LR, was Compiler Compiler Compiler dr_feriozi@prodigy.net (Dan Feriozi) (2001-04-15) |
From: | "Mike Dimmick" <mike@dimmick.demon.co.uk> |
Newsgroups: | comp.compilers |
Date: | 10 Apr 2001 01:27:39 -0400 |
Organization: | Compilers Central |
References: | 01-03-095 01-03-122 01-04-026 |
Keywords: | parse, LALR, LL(1), comment |
Posted-Date: | 10 Apr 2001 01:27:39 EDT |
"Joachim Durchholz" <joachim_d@gmx.de> wrote in message
> Mike Dimmick <mike@dimmick.demon.co.uk> wrote:
> > An LALR(1) grammar tool produces its syntax trees from the
> > bottom up; it therefore processes its input keeping as many
> > possibilities in mind (the set of possible reductions) as it can, then
> > selecting one when the input becomes unambiguous. The introduction of
> > these semantic actions causes problems - the generator cannot know
> > which rule is to be reduced, so should it execute the action or not?
> > YACC treats the rule containing the action differently from those
> > without, I believe. This usually means including the action at the
> > equivalent point in all such rules, then backing out if it turns out
> > to be incorrect.
>
> Are you *sure* this is correct? yacc doesn't need to undo parsing steps,
> and it cannot undo semantic actions (they might delete a file or do
> other undoable things), so I'm having trouble to see how this should
> have entered yacc's design.
Sorry, my text was a little ambiguous. Specifically with regard to
languages such as C++ which _require_ semantic actions in funny
places, backing out of the _executed_ _action_ may be necessary (in
the user's code, not in the parser's generated code).
Your assertion that Yacc doesn't need to undo parsing steps is _in
general_ true, for a properly defined LALR(1) grammar. However, for
the _specific_ case of C++ (and possibly some others? I don't know,
I've only tried to write a parser for C++) there are
declaration/expression conflicts - ambiguities in fact - which resolve
in the standard to 'if it can be parsed as a declaration, it is a
declaration'. This requires backtracking of the parse, because the
parts of the declaration and the parts of the expression don't match
up, and the problem is found only at some deep level of the parse.
Any identifiers inserted into symbol tables will also have to be
removed accordingly, the scope will have to be returned, etc.
Similarly, there is a (moderately well documented, see Roskind or
Willink) ambiguity between an anonymous member bit-field of 'class' or
'struct' elaborated type, and a nested class or struct inheriting.
The constructs are not completely ambiguous but can only be resolved
at a depth of at minimum 6 tokens, maximum virtually infinite.
Yacc does not natively support back-tracking, so the user must resort
to trickery with error rules and the token stream in order to perform
back-tracking. Other compiler generators
> Here's a question: does anybody know whether LL parsers are faster than
> LR parsers? I wouldn't expect any differences, but I never had the
> opportunity to run a direct comparison.
It may depend on how the LL(k) parser is implemented. Experimental
results by Terence Parr (sorry, I've lost the reference, but it was
probably either on www.antlr.org or www.polhode.com/pccts.html)
suggested that function-call overhead is actually extremely small now,
so a generated recursive-descent parser was faster than its equivalent
table-driven LL(k) parser. This may also have something to do with
the fact that the tables are either compressed (and therefore require
a certain amount of additional manipulation to find the correct value)
or very large (and therefore don't fit in the processor's cache).
Table-driven parsers' main loop tends to be fairly large too.
Indeed, the table-driven LL parser's prediction stack is almost
certain to be implemented on the heap as a linked list or dynamically
sized array, whereas the recursive-descent stack is the standard
program stack. Programming languages tend to be a lot faster with
stack-allocation than with dynamic memory allocation.
As for LL(k) versus LR(k) parsing, the LR parser must always maintain
more state than the LL parser. The reason for this is that the LL
parser makes its choice early, then proceeds through that single set
of choices (and does not need to maintain any prior tokens) while the
LR parser must always keep a stack of how far it has proceeded with
the current set of possible handles. It will depend, of course, on
whether the LL parser is actually building an AST or whether it is
just steering. The tables have to be bigger too; while the LL parser
tables consist merely of current state/lookahead token pairs to
determine the next state to enter, the LR tables must also feature an
action table dependent on both current state and look-ahead token (for
any method more complicated than LR(0)). This of course increases the
chances that it won't all fit into the processor's cache - some tens
or hundreds of kilobytes is typical, depending on the number of states
generated.
Hunter (1981, p120) refers to a study by Cohen and Roth (1978) in
which they compared maximum, minimum and average times taken to parse
sentences using LL and SLR parsers. In the context of the DEC PDP-10,
LL turned out to be faster by around 50%. It appears from context
that this may have been a table-driven LL parser.
There are also opportunities for ordering the functions in a
recursive-descent parser so that the most commonly touched features
appear on the same page, while less used features are on a different
page, which will (probably) improve performance on virtual memory
systems.
> There are other differences that might be more relevant. For
> example, LL parsers have been consistently reported to produce
> better error reporting. Does anybody know whether that's a historic
> accident, or is there something in LL parsing that makes it
> inherently easier to add useful error actions to LL parsers? (I
> don't mean the PCCTS error handling mechanism - it's useful but it
> didn't look LL-specific to me.)
[Acknowledgement: my source here is _Modern Compiler Design_, by
Grune, Bal, Jacobs and Langendoen, published by Wiley, 2000, ISBN
0-471-97697-0. Pages 138-142 and 175-178 are most relevant to this
section. Thanks also to Ian Darling for lending me this book!]
The error handling mechanism for most LL parser generators is based on
acceptable sets. This is a framework for constructing a safe recovery
method and consists of three steps:
1. Construct an 'acceptable set' (call it 'A') from the state of the
parser, using an algorithm 'C'; 'A' must contain end-of-file;
2. Discard tokens from the input until a token tA from A is found;
3. Resynchronise the parser by advancing until it arrives in a state
in which it consumes tA, using a suitable algorithm R - this
prevents looping.
The current best form appears to be based on _continuations_.
Consider that the error has occurred due to a premature end of file.
The rest of the parse tree must therefore be constructed from no input
whatsoever. A sequence of terminals that fulfils the predictions on
the stack is called a _continuation_ of that stack. A continuation
can be produced by replacing each non-terminal on the stack by a
terminal production of it. There are infinitely many ways to do this,
but for convenience we choose the shortest way out. This can be done
by producing additional information for recovery at (practically) any
given point.
Step 2 then becomes 'zero or more tokens from the input are skipped,
until a token in A is met.' Step 3 continues with a slightly modified
parser - it first tries the usual predict or match move. If this
succeeds the normal parser can take over again, otherwise the shortest
path is predicted for a non-terminal on top of the prediction stack,
or the appropriate terminal is inserted.
LR recovery is more tricky as there are two variables - the input and
the stack. We would like to avoid modifying the stack if possible,
and this can be done using the acceptable set technique of
continuations, but the algorithm for LR parsers is reportedly much
more complicated and therefore error recovery tends to modify the
stack. This means discarding some of the processed input and its
corresponding parse tree.
The typical LR method is illustrated by Yacc. Some of the
non-terminals in the parser may be marked as error-recovering
non-terminals. When an error occurs, the top of the stack has state
sX and the input starts with tX, and the corresponding ACTION table
entry is empty (neither a shift nor a reduce). Error-recovery
proceeds by removing entries from the stack until an error-recovering
state is reached. The dummy 'error' node is constructed and pushed
onto the stack along with the state corresponding to the completed
non-terminal. Input is then skipped until a token is found which has
an outbound state in the ACTION table and can follow the rule in which
error-recovery was enabled.
The important point to note in all that is that _typical_ LR recovery
tends to discard correctly-parsed information, which can therefore
mislead later stages of compilation and thereby give ambiguous or just
plain incorrect error messages. The placement of error-recovery rules
must be considered carefully too. The more accurate technique that
does _not_ discard information is (or seems to be) considered too
costly to implement.
In order to report an error, in the face of this technique, that is
more helpful than Yacc's default 'parse error' - gee, thanks a bunch -
one must of course append an action to the error rule (and suppress
Yacc's default as well) which produces a 'better' report. However,
there is of course a trade-off - you either have to implement
error-handling non-terminals at practically every point at which an
error could occur, or you have to put up with a lesser granularity of
error detection and reporting. This essentially leads to greater rule
duplication as well, which causes maintenance problems.
One technique that may work (see Willink) is to perform a superset
parse, then treat the errors (that were absorbed in the parse) in a
later semantic handling stage.
The corresponding LL technique, however, is fairly painless to
implement. So I suppose the answer has to be that it is _not_
historical accident that LL parsers are reported to produce better
error handling - it is that the appropriate algorithm is far simpler.
> [Yacc and other LALR parsers don't backtrack, the state machine in the
> parser implicitly tracks ambiguous parses, but they all have to be
> resolved to something unambiguous before anything reduces. I have
> seen an extended yacc with explicit backtracking that someone wrote
> to parse C++. -John]
I think you may be thinking of Yacc++, which does perform
backtracking. Unfortunately (for me, I can't afford it!) it's a
commercial product. It also features semantic predicates in much the
same fashion as ANTLR/PCCTS, I'm told. This limits classification of
identifiers (I'm sure the feature has other facilities, just not ones
I've encountered) to distinct places within the grammar, reducing
duplication, and only to classifications required at that point.
I will go back to my original statement, though somewhat simplified -
a programming language should probably be developed, insofar as is
possible, such that _any_ practical parsing technique may be used to
parse the language. Hence my suggestions on another newsgroup
(comp.compilers.tools.pccts) that the C and C++ syntaxes should not be
used to form a base for any further languages. It is way too hard to
write a parser for them - either manually or with a parser generator.
If you want to be able to develop simple tools for a language, ones
that are easily intelligible and extensible, keep your language LL(1)
(since all LL(1) languages are also parsable LR(1)).
Going back to the original question posed - how can one write a
compiler-compiler-compiler? I now see that the poster was effectively
asking, how can I port my compiler generator to a new language? The
answer would have to be to try to get the compiler generator written
in itself as far as possible, and to have a strong immutable interface
between an intermediate representation and a back-end that translates
the intermediate representation to the target language. Porting the
compiler generator to the new language would then require rewriting
the parts the generator cannot reach to produce the new language (a
step obviously necessary to generate the new language at all), then
running the compiler generator through itself, then rewriting the back
end in the new language, then compiling the lot together. You would
then need to rewrite the compiler back-end in the language itself and
run that through the existing compiler.
I would recommend _this_ process only after one has worked with the
new language for a moderately significant amount of time. Doing it
simultaneously with the language development is guaranteed to cause
you problems trying to work out whether the problem is in the parser
specification or the back-end implementation (of either the compiler
generator or the compiler itself!)
--
Mike Dimmick
References:
Roskind, 1991: this grammar is still available on the comp.compilers
archive, I think, including Roskind's own comments on it.
Willink, 1999: http://www.computing.surrey.ac.uk/research/dsrg/fog/ If you
need to backtrack in Yacc, pp 270 - 288 of the (draft) thesis give an
indication of how to perform it.
Hunter, 1981, _The Design and Construction of Compilers_, Chichester: Wiley,
ISBN 0-471-28054-2.
[The backtracking yacc I was thinking of is btyacc. See message 99-07-072
in the archives. -John]
Return to the
comp.compilers page.
Search the
comp.compilers archives again.