From: | Hans Aberg <haberg@matematik.su.se> |
Newsgroups: | comp.compilers |
Date: | 30 Jun 2004 23:13:59 -0400 |
Organization: | Compilers Central |
References: | 04-06-122 |
Keywords: | parse |
Posted-Date: | 30 Jun 2004 23:13:59 EDT |
Clint Olsen <clint@0lsen.net> wrote:
>... your comment about matching long input doesn't jive with the
>flex documentation. In fact, you usually *want* to match the longest
>input possible to spend as little time as possible doing work per
>character of input. The flex documentation goes to great lengths to
>show how refactoring can affect performance in a positive way at the
>expense of some more verbose and seemingly redundant regular
>expressions. The LOPLAS article on re2c also suggests this technique.
>It can be likened to loop unrolling in the regular-expression domain.
...
>I do seem to recall that the flex developers have dealt with the
>historical problem of keeping track of yylineno affecting lexer
>performance. The long input match problem you mentioned seems to
>indicate a problem with their input buffering mechanism.
These matters have (of course) been thoroughly discussed in the Help-flex
mailing list:
help-flex@gnu.org
http://mail.gnu.org/mailman/listinfo/help-flex
And, indeed, I also think that the long rule matching problem suggests
a buffering problem in flex. The problem has been noted, if not fixed
by now, it will be in the future. Until then, the case where
tokenization make a significant difference is for comments: There it
is far more faster to scan a multi-line comment line-by-line that
doing it in one whole chunk. See the thread "Nested comments" in the
Flex mailing list, especially Akim Demaille's suggesting 2003-07-03. I
gather the same thing applies if one admits multiline strings. Then it
is probably faster to also scan them line by line, and join the
segments together in the lexer actions. As for the question of
yylineno tracking, turning it on/off, and lexing performance I could
not find any significant difference in the example.
Here is the Flex code I use for nestable [* ... *] style comments:
"[*" { BEGIN(comment); comment_level = 1; }
<comment>{ /* Comments. */
"*]" { /* End of the comment. */
if (--comment_level == 0) {
BEGIN INITIAL;
}
}
"[*" { comment_level++; }
[^*[\n]+ {}
\n+ {}
. { /* Stray `*' and `['. */ }
<<EOF>> {
std::cerr << "Error: Nested comments not properly closed at end of
file.\n";
BEGIN(INITIAL); return YYERRCODE;
/* exit_set (exit_scan); */
}
}
"*]" { std::cerr << "Error: Too many comment closings *].\n";
BEGIN(INITIAL); return YYERRCODE; }
As you can see, the line
[^*[\n]+
only scans the comment line-by-line.
Hans Aberg
Return to the
comp.compilers page.
Search the
comp.compilers archives again.