Re: The case of directly-executable parsers

"" <>
30 Jan 2006 02:01:00 -0500

          From comp.compilers

Related articles
The case of directly-executable parsers (2006-01-28)
Re: The case of directly-executable parsers (Derek M. Jones) (2006-01-28)
Re: The case of directly-executable parsers ( (2006-01-30)
Re: The case of directly-executable parsers (Clint Olsen) (2006-01-31)
Re: The case of directly-executable parsers (Ranjit Mathew) (2006-01-31)
Re: The case of directly-executable parsers (Karsten Nyblad) (2006-01-31)
Re: The case of directly-executable parsers ( (2006-02-02)
| List of all articles for this month |

From: "" <>
Newsgroups: comp.compilers
Date: 30 Jan 2006 02:01:00 -0500
References: 06-01-10406-01-109
Keywords: parse, performance
Posted-Date: 30 Jan 2006 02:01:00 EST

Derek M. Jones wrote:
> Velco,
> > What could be the reason for this parser performing so poorly? Or for
> > the Bison parser performing so well? Has anyone got some recent
> > observations with directly-executable vs. table driven recognizers -
> > scanning (re2c?), parsing, tree-matching?
> The generated code contains lots of jumps. I am guessing that
> instruction pipeline never gets to crank up before it is flushed, also
> all those jumps must be very bad for lookahead prediction and I dread
> to think about the cache swapping that must be going on.
> In the "good old days" jumps did not have such a big impact on
> performance (well, at least on many processors). So I am guessing
> that the excessive number of jumps are skewing the performance
> characteristics, which did not happen on the processors used for the
> previous timing comparisons.

Yeah, that sounds reasonable. It probably also explains why on PIII
Xeon the speedup is a bit better.

> Also gcc does not seem to be consistent in optimizing those switch ()
> { case L: goto Q; ...} constructs. Sometimes it turns the switch into
> a nested-if, other times it populates the jump table with the
> destination address of the goto.

I have also implemented a kind of a "switch optimizer". After
removing the most frequent transition, the keys are separated in
"dense" ranges, where density threshold is a parameter (say, 0.33 or
0.5). Then a linear or a binary search is performed depending on the
number of ranges (say >= 8 binary search, otherwise - linear).

This did not result in any performance gain, although it did reduce the
code size by a decent amount.

> Have you tried using dynamic profiling information to get gcc to
> tune the generated code? That might result in some of the less
> frequently executed code moved some place where it won't keep
> filling the cache.

Yes, the results were obtained with gcc 4.0.3 or 4.2 as follows:
1) compilation with -march=<whatever> -O3 -fomit-frame-pointer
2) one run on sample data
3) compilation with -march=<whatever> -O3 -fomit-frame-pointer
4) timing on the same data


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.