Re: lexing backwards

"Ron Pinkas" <Ron@Profit-Master.com>
13 Apr 2003 12:18:45 -0400

          From comp.compilers

Related articles
lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-05)
Re: lexing backwards haberg@math.su.se (2003-04-07)
Re: lexing backwards cfc@world.std.com (Chris F Clark) (2003-04-07)
Re: lexing backwards maratb@cs.berkeley.edu (Marat Boshernitsan) (2003-04-07)
Re: lexing backwards stan@zaborowski.org (Stan Zaborowski) (2003-04-13)
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-04-13)
Re: lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-15)
Re: lexing backwards cfc@TheWorld.com (Chris F Clark) (2003-04-15)
Re: lexing backwards genew@mail.ocis.net (2003-05-06)
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-05-14)
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-05-16)
Re: lexing backwards genew@mail.ocis.net (2003-05-16)
[2 later articles]
| List of all articles for this month |

From: "Ron Pinkas" <Ron@Profit-Master.com>
Newsgroups: comp.compilers
Date: 13 Apr 2003 12:18:45 -0400
Organization: PTS
References: 03-04-015 03-04-026
Keywords: lex
Posted-Date: 13 Apr 2003 12:18:44 EDT

> One can understand this by looking at the three general classes of
> tokens that exist in most programming languages.
> ...


I'm happy you brought this point, because I was "forced" to develop a lexing
engine after recognizing that there are few very specific classes of tokens,
but no lexing engine I was familiar with tried to offer a solution based on
this approach. After carefuly reviewing few programming languages I found
the following classes of tokens:


Delimiters
----------


  These are like the C Language:


    )(][.,><!=


They are single charcaters that outside of the context of Streems are
unconditional delimiters of prior input and are also tokens on their own.


White Space
-------------


These may also be considered a sub class of Delimiters, i.e. Disposable
Delimiters. Once found outside the context of a Strem, they function as a
terminator of the prior input, but they themself are no longer valuable, and
may be disposed of.




Self Contained
---------------


These are tokens like the C language:


      -> ++ -- := ==


That's to say no Delimiter is required to terminate such token. One may also
think of this class of tokens, as Multi-Character delimiters. Once found in
the input outside the context of a Stream, they serve as unconditional
terminator of the prior input, and are also tokens on their own.


Streams
--------


These are like the C language:


      "This is a string"


Streams are made of a Stream Prefix like " in C, (' and [ or even [[ in some
other languages) followed by a steam of any number of charcters terminated
with the given *matching* Stream Terminator, like " in C (may also be multi
character token).


[Comments may also be considered streams, though they may be commonly
handled at a pre-lexing statge.]


End of Line
------------


These tokens are like the C language:


      ; (and OS dependants New Line character)


Once found outside the context of a Stream they serve as unconditional
terminators of the prior input, and are typically used as flags, indicating
the context of a New Line.


Words
-------


These are tokens like the C language:


      int void signed volatile function


This class of tokens *must* be *delimited* by a pre and post Delimiter (or
disposable delimiters). While this class of tokens is considered reserved
words in the C language they may just as well be non reserved in other
languages, where context allows them to be non reserved tokens.


Key Words
------------


These tokens are like the C Language:


      static switch case while


This class of tokens *have* to be the *first* non disposable token in a
given line (signified by BOF or EOL).


Elements (residuals)
---------


Any and all input found *between* the 5 kinds of unconditional terminators
( Delimiters, Self Contained, White Space, Streams, and End of Line ) that
are *not* Words or Key Words, are Elements of the given language, and are
usually divided to:


      Identifiers
      Literal Numbers


Since I regard data driven solutions to be generally superior
solutions, I developed SimpLex
(http://sourceforge.net/projects/simplex) a Lexing engine accepting
simple definitions of the above classes of tokens for a given language
and thus serves as a full featured scanner for the given language.
Such scanner is tipically about 1/4 the size of an eqivalent [F]Lex
generated Scanner [mariginally faster too], and does not require a
"compilation" step as is needed by [F]Lex.


Ron


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.