Re: What should be check in Lexical Analyzer along with generating tokens?

"Chris F Clark" <cfc@shell01.TheWorld.com>
20 Sep 2002 02:52:16 -0400

          From comp.compilers

Related articles
What should be check in Lexical Analyzer along with generating token vikramasanjeeva@hotmail.com (Vikrama Sanjeeva) (2002-09-14)
Re: What should be check in Lexical Analyzer along with generating t joachim_d@gmx.de (Joachim Durchholz) (2002-09-19)
Re: What should be check in Lexical Analyzer along with generating cfc@shell01.TheWorld.com (Chris F Clark) (2002-09-20)
RE: What should be check in Lexical Analyzer along with qjackson@shaw.ca (Quinn Tyler Jackson) (2002-09-22)
Re: What should be check in Lexical Analyzer along with generating clint@0lsen.net (Clint Olsen) (2002-09-25)
Re: What should be check in Lexical Analyzer along with generating lars@bearnip.com (Lars Duening) (2002-10-25)
Re: What should be check in Lexical Analyzer along with generating lex@cc.gatech.edu (Lex Spoon) (2002-10-25)
| List of all articles for this month |

From: "Chris F Clark" <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 20 Sep 2002 02:52:16 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 02-09-087 02-09-110
Keywords: lex, design
Posted-Date: 20 Sep 2002 02:52:15 EDT

Joachim Durchholz wrote:
> A lexer should avoid checks as far as possible.
...
> A lossless lexer (i.e. a lexer that emits enough information to allow a
> character-by-character reconstruction of the source) should be the most
> flexible approach.


These are both very good principles! You should probably take these
as your basic rules of thumb.


I have a third rule of thumb you can add to that list.


- Pick the division point between your lexer and parser where it makes
    the most semantic sense.


Let me now expand on the rules.


> A lexer should avoid checks as far as possible.


Keeping the lexer as simple as possible is perhaps one of the most
important factors in speeding up the overall performance of the
resulting compiler, since in most compilers more time is spent in the
lexer than any other single part.


There are a few small exceptions to this rule. One is relevant to
what Joachim wrote.


If there are tokens which the parser *never* cares about, it can save
overall time if the lexer just drops those tokens on the floor. That,
of course, must be balanced against the flexibility of passing the
tokens on and letting a later phase throw them away, which is the 2nd
rule of Joachim's I quoted.


> A lossless lexer [is] the most flexible approach.


As parser generators have evolved, many parser generators have devised
ways of allowing their lexers to be lossless. This generally
simplifies both the lexing and the parsing phase (once a *good*
mechanism to do so is developed).


One of the key issues is finding a way to preserve the tokens that
would normally be thrown away but not force the grammar writer to
clutter the grammar indicating each place the token is allowed. The
reason why most lexers throw certain tokens away is to simplify the
job of the parser, where listing all those tokens in all the places
where they *might* appear can make the grammar too large, complex, and
unreadable.


Here are some solutions different parser generators have tried and
that I am aware of.


For example, JavaCC allows "special" tokens. These are tokens the
lexers recognizes but which it does not send along to the parser in
the normal token stream. Instead the lexer attaches these tokens in a
list to a nearby non-special token. That way the grammar doesn't
normally see the special tokens, but by looking at the lists of
special tokens attached to the tokens the parser does see, the parser
can recover them.


ANTLR modified that idea by coming up with the concept of "channels",
each token can be associated with one or more channels. The parser
then listens to a specific channel and only the tokens that it is
interested in are reported to it. That way the parser can get only
those tokens which are relevant to it.


Yacc++ has several related concepts for tokens. "Discard" tokens are
ones the lexer throws away (this can be selectively overriden).
"Ignore" tokens are ones the lexer doesn't throw away but which the
parser will throw away when not referenced in the grammar. "Remove"
tokens are ones which affect the parse, but are thrown away before the
semantic actions.


In a recent Verilog grammar, I combined a variety of the techniques.


For example, comments were discard tokens where I used semantic code
in the lexer to thread them onto lists of "normal" tokens so as to
mimic the special token feature of JavaCC. (When we get the chance,
we will integrate those ideas into Yacc++ so that it is a standard
feature other developers can use easily.) The reason I wanted the
comments to be special tokens is that in the Verilog tool the grammar
was written for we preserve comments from input to output and some of
the comments can have semantic effects.


Whitespace tokens are ignored tokens, but in the rule which handles
`define statements, I made them relevant as to make `defines work the
way #defines do in C.


We made macro tokens remove tokens, since we don't want the macros
invocations appearing in the resulting semantic stream, only their
expansions.


If you read a recent posting by Ira Baxter on C preprocessing, you
will see that they go to certain lengths to handle preprocessing
tokens without throwing them all away.


- Pick the division point between your lexer and parser where it makes
    the most semantic sense.


This is the point where the resulting tokens represent complete
semantic units, but where the lexer is not actually doing any
parsing.


For example, numbers generally represent complete semantic units, so
it is best to make numbers into tokens.


The same thing is true of names (identifiers). This is especially
true if your language has keywords (specific identifiers that have
syntactic meanings).


However, lexing "identifier()" as a a single token generally doesn't
make sense, as that precludes the user from having whitespace or
comments in the token. (Of course, if your language disallows
whitespace and comments there, it could make sense as a single token.)


That goes back to the previous point. If there are things which are
semantically and syntactically irrelevant, you want to split your
tokens so that the irrelevant things can be skipped *exactly* where
they are irrelevant. Thus, in most lanugages, since an identifier
cannot have embedded whitespace, both whitespace and identifiers
should be tokens.


Going back to the Verilog grammar I wrote, one sees this in the syntax
of "sized" numbers. The language spec specifically calls out that
whitespace is allowed between certain parts of sized numbers and so is
macro expansion on those parts. Therefore, the correct implementation
was to make each of the parts a unique token, even though they are
normally parts of a semantic unit.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.