Re: lexing backwards

"Ron Pinkas" <Ron@Profit-Master.com>
13 Apr 2003 12:18:45 -0400

From comp.compilers

Related articles
lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-05)
Re: lexing backwards haberg@math.su.se (2003-04-07)
Re: lexing backwards cfc@world.std.com (Chris F Clark) (2003-04-07)
Re: lexing backwards maratb@cs.berkeley.edu (Marat Boshernitsan) (2003-04-07)
Re: lexing backwards stan@zaborowski.org (Stan Zaborowski) (2003-04-13)
*Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas)* (2003-04-13)**
Re: lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-15)
Re: lexing backwards cfc@TheWorld.com (Chris F Clark) (2003-04-15)
Re: lexing backwards genew@mail.ocis.net (2003-05-06)
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-05-14)
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-05-16)
Re: lexing backwards genew@mail.ocis.net (2003-05-16)
[2 later articles]

| List of all articles for this month |

From:	"Ron Pinkas" <Ron@Profit-Master.com>
Newsgroups:	comp.compilers
Date:	13 Apr 2003 12:18:45 -0400
Organization:	PTS
References:	03-04-015 03-04-026
Keywords:	lex
Posted-Date:	13 Apr 2003 12:18:44 EDT

> One can understand this by looking at the three general classes of
> tokens that exist in most programming languages.
> ...

I'm happy you brought this point, because I was "forced" to develop a lexing
engine after recognizing that there are few very specific classes of tokens,
but no lexing engine I was familiar with tried to offer a solution based on
this approach. After carefuly reviewing few programming languages I found
the following classes of tokens:

Delimiters
----------

  These are like the C Language:

    )(][.,><!=

They are single charcaters that outside of the context of Streems are
unconditional delimiters of prior input and are also tokens on their own.

White Space
-------------

These may also be considered a sub class of Delimiters, i.e. Disposable
Delimiters. Once found outside the context of a Strem, they function as a
terminator of the prior input, but they themself are no longer valuable, and
may be disposed of.

Self Contained
---------------

These are tokens like the C language:

      -> ++ -- := ==

That's to say no Delimiter is required to terminate such token. One may also
think of this class of tokens, as Multi-Character delimiters. Once found in
the input outside the context of a Stream, they serve as unconditional
terminator of the prior input, and are also tokens on their own.

Streams
--------

These are like the C language:

      "This is a string"

Streams are made of a Stream Prefix like " in C, (' and [ or even [[ in some
other languages) followed by a steam of any number of charcters terminated
with the given *matching* Stream Terminator, like " in C (may also be multi
character token).

[Comments may also be considered streams, though they may be commonly
handled at a pre-lexing statge.]

End of Line
------------

These tokens are like the C language:

      ; (and OS dependants New Line character)

Once found outside the context of a Stream they serve as unconditional
terminators of the prior input, and are typically used as flags, indicating
the context of a New Line.

Words
-------

These are tokens like the C language:

      int void signed volatile function

This class of tokens *must* be *delimited* by a pre and post Delimiter (or
disposable delimiters). While this class of tokens is considered reserved
words in the C language they may just as well be non reserved in other
languages, where context allows them to be non reserved tokens.

Key Words
------------

These tokens are like the C Language:

      static switch case while

This class of tokens *have* to be the *first* non disposable token in a
given line (signified by BOF or EOL).

Elements (residuals)
---------

Any and all input found *between* the 5 kinds of unconditional terminators
( Delimiters, Self Contained, White Space, Streams, and End of Line ) that
are *not* Words or Key Words, are Elements of the given language, and are
usually divided to:

      Identifiers
      Literal Numbers

Since I regard data driven solutions to be generally superior
solutions, I developed SimpLex
(http://sourceforge.net/projects/simplex) a Lexing engine accepting
simple definitions of the above classes of tokens for a given language
and thus serves as a full featured scanner for the given language.
Such scanner is tipically about 1/4 the size of an eqivalent [F]Lex
generated Scanner [mariginally faster too], and does not require a
"compilation" step as is needed by [F]Lex.

Ron

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: lexing backwards

"Ron Pinkas" <Ron@Profit-Master.com>13 Apr 2003 12:18:45 -0400

"Ron Pinkas" <Ron@Profit-Master.com>
13 Apr 2003 12:18:45 -0400