Related articles |
---|
lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-05) |
Re: lexing backwards haberg@math.su.se (2003-04-07) |
Re: lexing backwards cfc@world.std.com (Chris F Clark) (2003-04-07) |
Re: lexing backwards maratb@cs.berkeley.edu (Marat Boshernitsan) (2003-04-07) |
Re: lexing backwards stan@zaborowski.org (Stan Zaborowski) (2003-04-13) |
Re: lexing backwards Ron@Profit-Master.com (Ron Pinkas) (2003-04-13) |
Re: lexing backwards monnier+comp.compilers/news/@rum.cs.yale.edu (Stefan Monnier) (2003-04-15) |
Re: lexing backwards cfc@TheWorld.com (Chris F Clark) (2003-04-15) |
Re: lexing backwards genew@mail.ocis.net (2003-05-06) |
[5 later articles] |
From: | Chris F Clark <cfc@world.std.com> |
Newsgroups: | comp.compilers |
Date: | 7 Apr 2003 00:25:35 -0400 |
Organization: | The World Public Access UNIX, Brookline, MA |
References: | 03-04-015 |
Keywords: | lex |
Posted-Date: | 07 Apr 2003 00:25:35 EDT |
There is no trick to lexing backwards. Lexing locally, however, is in
general impossible. The problem is that there are generally a class
of lexer tokens that have non-local extent.
One can understand this by looking at the three general classes of
tokens that exist in most programming languages.
1) the trivial "fixed spelling" tokens. These are tokens like "+", "++",
"+=", "=", "==", ":=", ":", "::", "::=", "<", "<<", "<<=", etc.
Now, as you can see from the above list, many languages have tokens
in this class that have overlapping spellings. However, in
general, the list of such tokens is fixed and the exact overlapping
problems are fixed and resolvable.
2) the "character class" tokens. These are tokens like identifiers
and numbers, where there is a set of characters which belong to the
token and the token expands to accept any length of string
containing those characters. Thus, "identifier" is not two
character class tokens "id" and "entifier", but simply one. (I am
intentionally ignoring the case of Fortran 66, where even this
class was difficult, since it of little interest to most Emacs
major mode writers.) These tokens are also not a problem.
3) the "delimiter" tokens. These are tokens like strings and
comments, where the token begins at one delimiter and ends at
another (possibly the first repeated). These are the problem
tokens. Each delimiter potentially divides the space into two
states: have seen the delimiter and have not. That state can only
be calculated from a fixed reference point (say the beginning of
the file) where the state is known. Here is an example from C that
illustrates some text one might find on a line and not know the
"state" except by knowing the context. You can find equivalent
examples in almost any language that has both multi-line comments
and multi-line strings.
state /*" ? string : comment = unknown ;
There are space/time-tradeoffs one could make to improve the
situation, like bookmarking all the places where delimiters occur in
the file and pre-computing the "state". However, one cannot lex
puerly locally unless one is willing to mis-lex the text. If I am
correctly informed, VC++ does just that (allows itself to mis-lex),
assuming one is not in a multi-line comment unless it detects a
multi-line comment delimiter nearby.
Hope this helps,
-Chris
*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
Return to the
comp.compilers page.
Search the
comp.compilers archives again.