Related articles |
---|
lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-05) |
Re: lexing backwards haberg@math.su.se (2003-04-07) |
Re: lexing backwards cfc@world.std.com (Chris F Clark) (2003-04-07) |
Re: lexing backwards maratb@cs.berkeley.edu (Marat Boshernitsan) (2003-04-07) |
Re: lexing backwards stan@zaborowski.org (Stan Zaborowski) (2003-04-13) |
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-04-13) |
Re: lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-15) |
Re: lexing backwards cfc@TheWorld.com (Chris F Clark) (2003-04-15) |
Re: lexing backwards genew@mail.ocis.net (2003-05-06) |
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-05-14) |
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-05-16) |
Re: lexing backwards genew@mail.ocis.net (2003-05-16) |
[2 later articles] |
From: | "Ron Pinkas" <Ron@Profit-Master.com> |
Newsgroups: | comp.compilers |
Date: | 13 Apr 2003 12:18:45 -0400 |
Organization: | PTS |
References: | 03-04-015 03-04-026 |
Keywords: | lex |
Posted-Date: | 13 Apr 2003 12:18:44 EDT |
> One can understand this by looking at the three general classes of
> tokens that exist in most programming languages.
> ...
I'm happy you brought this point, because I was "forced" to develop a lexing
engine after recognizing that there are few very specific classes of tokens,
but no lexing engine I was familiar with tried to offer a solution based on
this approach. After carefuly reviewing few programming languages I found
the following classes of tokens:
Delimiters
----------
These are like the C Language:
)(][.,><!=
They are single charcaters that outside of the context of Streems are
unconditional delimiters of prior input and are also tokens on their own.
White Space
-------------
These may also be considered a sub class of Delimiters, i.e. Disposable
Delimiters. Once found outside the context of a Strem, they function as a
terminator of the prior input, but they themself are no longer valuable, and
may be disposed of.
Self Contained
---------------
These are tokens like the C language:
-> ++ -- := ==
That's to say no Delimiter is required to terminate such token. One may also
think of this class of tokens, as Multi-Character delimiters. Once found in
the input outside the context of a Stream, they serve as unconditional
terminator of the prior input, and are also tokens on their own.
Streams
--------
These are like the C language:
"This is a string"
Streams are made of a Stream Prefix like " in C, (' and [ or even [[ in some
other languages) followed by a steam of any number of charcters terminated
with the given *matching* Stream Terminator, like " in C (may also be multi
character token).
[Comments may also be considered streams, though they may be commonly
handled at a pre-lexing statge.]
End of Line
------------
These tokens are like the C language:
; (and OS dependants New Line character)
Once found outside the context of a Stream they serve as unconditional
terminators of the prior input, and are typically used as flags, indicating
the context of a New Line.
Words
-------
These are tokens like the C language:
int void signed volatile function
This class of tokens *must* be *delimited* by a pre and post Delimiter (or
disposable delimiters). While this class of tokens is considered reserved
words in the C language they may just as well be non reserved in other
languages, where context allows them to be non reserved tokens.
Key Words
------------
These tokens are like the C Language:
static switch case while
This class of tokens *have* to be the *first* non disposable token in a
given line (signified by BOF or EOL).
Elements (residuals)
---------
Any and all input found *between* the 5 kinds of unconditional terminators
( Delimiters, Self Contained, White Space, Streams, and End of Line ) that
are *not* Words or Key Words, are Elements of the given language, and are
usually divided to:
Identifiers
Literal Numbers
Since I regard data driven solutions to be generally superior
solutions, I developed SimpLex
(http://sourceforge.net/projects/simplex) a Lexing engine accepting
simple definitions of the above classes of tokens for a given language
and thus serves as a full featured scanner for the given language.
Such scanner is tipically about 1/4 the size of an eqivalent [F]Lex
generated Scanner [mariginally faster too], and does not require a
"compilation" step as is needed by [F]Lex.
Ron
Return to the
comp.compilers page.
Search the
comp.compilers archives again.