Re: lexing backwards

Chris F Clark <>
7 Apr 2003 00:25:35 -0400

          From comp.compilers

Related articles
lexing backwards monnier+comp.compilers/news/ (Stefan Monnier) (2003-04-05)
Re: lexing backwards (2003-04-07)
Re: lexing backwards (Chris F Clark) (2003-04-07)
Re: lexing backwards (Marat Boshernitsan) (2003-04-07)
Re: lexing backwards (Stan Zaborowski) (2003-04-13)
Re: lexing backwards (Ron Pinkas) (2003-04-13)
Re: lexing backwards monnier+comp.compilers/news/ (Stefan Monnier) (2003-04-15)
Re: lexing backwards (Chris F Clark) (2003-04-15)
Re: lexing backwards (2003-05-06)
[5 later articles]
| List of all articles for this month |

From: Chris F Clark <>
Newsgroups: comp.compilers
Date: 7 Apr 2003 00:25:35 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 03-04-015
Keywords: lex
Posted-Date: 07 Apr 2003 00:25:35 EDT

There is no trick to lexing backwards. Lexing locally, however, is in
general impossible. The problem is that there are generally a class
of lexer tokens that have non-local extent.

One can understand this by looking at the three general classes of
tokens that exist in most programming languages.

1) the trivial "fixed spelling" tokens. These are tokens like "+", "++",
      "+=", "=", "==", ":=", ":", "::", "::=", "<", "<<", "<<=", etc.
      Now, as you can see from the above list, many languages have tokens
      in this class that have overlapping spellings. However, in
      general, the list of such tokens is fixed and the exact overlapping
      problems are fixed and resolvable.

2) the "character class" tokens. These are tokens like identifiers
      and numbers, where there is a set of characters which belong to the
      token and the token expands to accept any length of string
      containing those characters. Thus, "identifier" is not two
      character class tokens "id" and "entifier", but simply one. (I am
      intentionally ignoring the case of Fortran 66, where even this
      class was difficult, since it of little interest to most Emacs
      major mode writers.) These tokens are also not a problem.

3) the "delimiter" tokens. These are tokens like strings and
      comments, where the token begins at one delimiter and ends at
      another (possibly the first repeated). These are the problem
      tokens. Each delimiter potentially divides the space into two
      states: have seen the delimiter and have not. That state can only
      be calculated from a fixed reference point (say the beginning of
      the file) where the state is known. Here is an example from C that
      illustrates some text one might find on a line and not know the
      "state" except by knowing the context. You can find equivalent
      examples in almost any language that has both multi-line comments
      and multi-line strings.

      state /*" ? string : comment = unknown ;

There are space/time-tradeoffs one could make to improve the
situation, like bookmarking all the places where delimiters occur in
the file and pre-computing the "state". However, one cannot lex
puerly locally unless one is willing to mis-lex the text. If I am
correctly informed, VC++ does just that (allows itself to mis-lex),
assuming one is not in a multi-line comment unless it detects a
multi-line comment delimiter nearby.

Hope this helps,

Chris Clark Internet :
Compiler Resources, Inc. Web Site :
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.