Re: The case of directly-executable parsers

"" <>
2 Feb 2006 11:36:01 -0500

          From comp.compilers

Related articles
The case of directly-executable parsers (2006-01-28)
Re: The case of directly-executable parsers (Derek M. Jones) (2006-01-28)
Re: The case of directly-executable parsers ( (2006-01-30)
Re: The case of directly-executable parsers (Clint Olsen) (2006-01-31)
Re: The case of directly-executable parsers (Ranjit Mathew) (2006-01-31)
Re: The case of directly-executable parsers (Karsten Nyblad) (2006-01-31)
Re: The case of directly-executable parsers ( (2006-02-02)
| List of all articles for this month |

From: "" <>
Newsgroups: comp.compilers
Date: 2 Feb 2006 11:36:01 -0500
References: 06-01-10406-01-141
Keywords: parse, performance
Posted-Date: 02 Feb 2006 11:36:01 EST

Kdarsten Nyblad wrote:
> wrote:
> > Directly-executable LR parsers are well-known and a number of
> > researchers have published amazing speedups over table-driver LR
> > parsers, in the 2x to 6x or more range, cf.:
> >
> > Achyutram Bhamidipaty and Todd A. Proebsting
> > "Very Fast YACC-Compatible Parsers (For Very Little Effort)
> >
> > R.N.Horspool and M. Whitney
> > "Even faster LR parsing"
> >
> > Thomas J. Pennello
> > "Very fast LR parsing"
> >
> > I have developed such a LALR(1) generator, which generates a
> > directly-executable parser in ISO C. Admittedly, the parser performs
> > only a few low-level optimizations - inverted table for non-terminal
> > transitions and separating the most frequent transition as a default.
> > However, the speedup I observe is nowhere near the previously reported
> > ones. Compared to a parser, generated by Bison 2.1 for the ISO C 99
> > grammar (taken straight from the standard, with one minor
> > modification), the directly-executable parser is only 5-9% faster on
> > P4 and about 12% faster on P3 Xeon, at the same time being about 3
> > times bigger.
> >
> > I assume I'm not actually doing anything stupid with the generated
> > parser. A sample of the generated code (preprocessed and indent(1)ed)
> > is here:
> >
> > What could be the reason for this parser performing so poorly? Or for
> > the Bison parser performing so well? Has anyone got some recent
> > observations with directly-executable vs. table driven recognizers -
> > scanning (re2c?), parsing, tree-matching?
> There are several possibilities. You have not written, which compiler
> you are using, so we can only guess. Your code is full of examples of
> obvious inefficiencies, e.g.,
> do
> {
> }
> while (0);

[You have probably missed it, I have given the compilers and
  the command line options used in another message in this
  very thread.]

This is an artefact from discarding (via -DNDEBUG) parser trace
code ("Entering state X. Next token is ..."). The compiler does
discard it.

> Such construct will most likely be removed by the optimizer, but they
> may confuse it such that the code does not get as good as it should be.
> It is a dangerous assumption to assume that a compiler generates good
> code for constructs human programmers would never write. It might be
> that your compiler can handle only so many loops or basic blocks before
> it gives up optimizing.

Not the case.

> You are manipulating the stack through a set of inlined procedures. You
> are passing the stack as a pointer to a record. First, are you sure the
> procedures are inlined?


> Secondly, the compiler might not be smart
> enough to realize that the stack as a variable is not truly aliased.
> There could be a severe performance hit because the compiler may think
> the stack is aliased and therefore put the top of the stack in main
> memory in stead of having it in registers. I would drop that library of
> stack operations and implement the stack without the use of a record
> structure and as local variables of the parse routine.

This indeed had an insignificant improvement (along with a fixed
size stack), but I think the improvement was from the lack of stack
limit checks.

> Then you are jumping around much more than is necessary, e.g.,
> goto symbol_387;
> // lots of code
> symbol_387:
> switch (state)
> {
> default:
> goto push_317;
> }
> First you are again assuming that the compiler will generate good code
> from weird constructs that human programmers will never write.

Indeed, I fully expect from the compiler to perform dead code
and simple jump optimizations. From research or toy compilers I do not
expect good performance anyway. Unconditional jump to uncoinditional
jump is trivial to optimize. Yes, the compiler does it.

> You are
> doing that with the switch statement with only a default case label.

The compiler emits a single jump, in the case it's not subsumed by
other optimization. I see no reason for it to be compiled to different

> Secondly, the compiler will have to be very smart in order to generate
> code such that you jump directly from the first goto to push_317. Why
> not substitute "goto push_317;" for "goto symbol_387;" and delete the
> code following "symbol_387:".

In the general case there can be many jumps to "symbol_N:", one for
each production having N as its left-hand side.

And, no, the compiler does not need to be very smart, this is a trivial
optimization and the compiler in fact performs it.

> Along the same line: From each block of code starting with a label like
> reduce_* you jump to some code starting with a label like symbol_*. Why
> not move the code of symbol_* to after the first piece of code jumping
> to it?

See above. And in the cases there are multiple jumps, it is up to
the compiler to decide whether to duplicate the code.

> Then there is your variable state. There is no need to assign to it if
> the non kernel of the state does not include items of productions very
> the right hand side is not the empty string. You do not need to push
> the state, because it will never be read. All you need to do is to
> increase the stack top pointer with 1 and let the stack top have an
> undefined value. It will never be read anyway.

I'm not sure if I understand this. Do you mean there's no need to
push the state number for states with no non-terminal transitions?
Indeed, I can do this (when discarding trace code anyway).

> Also, you are passing the get_token routine in a record. The optimizer
> cannot find out that it is the same routine that is called each time.
> It will have to generate code to dereference the ctx variable each time
> the routine is called. In most cases there is one input source, one
> output, and the parser is called once. I would call the lexer directly
> and only make support for multiple concurrent calls to the parser optional.

This is not an acceptible design.

> You call xg__stack_ensure in lots of places, but it is possible to
> optimize most of these away. Consider a DFA were you store all the read
> tokens in a stack, i.e., the highs of the stack is equal to the number
> of read symbols. If there is no loops in the DFA, then it is a simple
> search to find the maximum number of symbols that can be read. Thus you
> can always make sure the stack is big enough before running the DFA. If
> there are loops, then you can do with one check for each time the DFA
> run through all the states of the loop to ensure enough free space in
> the stack, e.g., assume that the numbers are states and 1->2 is a
> transition from state 1 to state 2. We do not care about which symbols
> are read:
> 1->2
> 2->3
> 3->4
> 4->5
> 5->2
> 5->6
> state 1 is the start state and state 6 is the accept state. You can do
> with checking the among of free space on the stack at the start and in
> one of the states 2, 3, 4, or 5. The same trick can be used in a LR(1)
> automaton. In your case it will most likely save you most of the
> procedure calls to xg__stack_ensure.

Thanks, I'll consider this one. However, I've tried a fixed size
stack, with no checks whatsoever and it improved the code
insignificantly. Thus, this also is cannot be a reason for 2x or
4x slowdown.

> A few more comments: First what about error recovery? A parser
> generator that generates parsers without error recovery has only so much
> use, and the error recovery can have many implications on how the
> directly executed parser is implemented.

I'm not interested in optimizing the case of syntax errors present
in the input. Also, I have no reasons to believe the current way of
generating the parser routine is an obstacle to error recovery,
see Bhamidipaty and Proebsting's paper.

> Secondly, you do not have any
> semantic actions on your reductions. All modern computers have cache
> between main storage and the processor. That cache will speed up code
> as long as the code can be in the cache. That has two implications.
> The first implication is that you current benchmarking really is not
> that interesting. It should wait until you have your semantic actions
> in place. The second implication is that if the table driven parser can
> be in the cache, but you directly executed cannot, then it is no
> surprise that the directly executed parser is not that fast.

This is a likely reason. Both the Bison generated and this parser fit
in L2 cache, but the Bison parser probably fits in L1 too (it's 9K text
+ rodata, ~300B data and bss).


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.