Re: lexer speed, was Bison

"BartC" <bc@freeuk.com>
Tue, 21 Aug 2012 17:39:38 +0100

          From comp.compilers

Related articles
=?UTF-8?Q?Bison_determinis=E2=80=8Btic_LALR=281=29_parser_for_Java=2FC hsad005@gmail.com (2012-08-17)
Re: Bison =?UTF-8?B?ZGV0ZXJtaW5pc+KAi3RpYyBMQUxSKDEpIHBhcnNlciBm?= =?U DrDiettrich1@aol.com (Hans-Peter Diettrich) (2012-08-18)
Re: lexer speed, was Bison DrDiettrich1@aol.com (Hans-Peter Diettrich) (2012-08-20)
Re: lexer speed, was Bison DrDiettrich1@aol.com (Hans-Peter Diettrich) (2012-08-20)
Re: lexer speed, was Bison cr88192@hotmail.com (BGB) (2012-08-20)
Re: lexer speed, was Bison DrDiettrich1@aol.com (Hans-Peter Diettrich) (2012-08-21)
Re: lexer speed, was Bison bc@freeuk.com (BartC) (2012-08-21)
| List of all articles for this month |

From: "BartC" <bc@freeuk.com>
Newsgroups: comp.compilers
Date: Tue, 21 Aug 2012 17:39:38 +0100
Organization: A noiseless patient Spider
References: 12-08-005 12-08-006 12-08-008
Keywords: parse, lex, performance
Posted-Date: 21 Aug 2012 20:58:29 EDT

"Hans-Peter Diettrich" <DrDiettrich1@aol.com> wrote in message
>> [Compilers spend a lot of time in the lexer, because that's the only
>> phase that has to look at the input one character at a time. -John]
>
> When the source code resides in a memory buffer, the time for reading
> e.g. the characters of an identifier (in the lexer) is neglectable vs.
> the time spent in lookup and entering the identifier into a symbol table
> (in the parser).
>
> Even if a lexer reads single characters from a file, most OSs maintain
> their own file buffer, so that little overhead is added over the
> program-buffered solution.
>
> I really would like to see some current benchmarks about the behaviour
> of current compilers and systems.


I was recently testing a new language with a bare lexer for a near-identical
syntax.


This test, which excluded file input and any identifer lookups (just
tokenising), ran at 17M chars/second, 3M tokens/second, and some 800K
lines/second, on a low-end desktop PC. At that rate, tokenising the entire
compiler might take 25msec.


If that represented the bulk of the compilation time, then I'd be quite
happy! But I suspect it will be just a fraction.


While lexers do have to deal with a character at a time, the processing
might be quite simple (eg. merely six instructions for each character of an
identifier in my version).


Oh, I should mention that this lexer is written in an interpreted,
dynamically typed language. That means a native code version might process
some 20M tokens per second (3 or 4 msec to scan the entire compiler). That
doesn't really suggest it's going to be a bottleneck!


> [The benchmarks I did were a while ago, but they showed a large
> fraction of time in the lexer.
...
  -John]


Testing an older compiler (which doesn't lend itself to isolating just the
bare tokeniser), showed it spent about 50% of compilation time in
tokenising, lookups and parsing. This generates simple byte-code, so with a
more sophisticated one with more passes, optimisation and so on, it would
likely be even less.


(But it's also possible my compilers are highly inefficient apart from the
tokenising parts..)


--
Bartc


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.