Re: Buffered input for a lexer?

"Randall Hyde" <rhyde@cs.ucr.edu>
25 Mar 2002 01:14:46 -0500

          From comp.compilers

Related articles
Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? zackw@panix.com (Zack Weinberg) (2002-03-24)
Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-24)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-03-25)
Re: Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-25)
Re: Buffered input for a lexer? clint@0lsen.net (2002-03-31)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-31)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-31)
Re: Buffered input for a lexer? joachim_d@gmx.de (Joachim Durchholz) (2002-03-31)
Re: Buffered input for a lexer? cgweav@aol.com (2002-03-31)
[12 later articles]
| List of all articles for this month |

From: "Randall Hyde" <rhyde@cs.ucr.edu>
Newsgroups: comp.compilers
Date: 25 Mar 2002 01:14:46 -0500
Organization: Prodigy Internet http://www.prodigy.com
References: 02-03-162
Keywords: lex
Posted-Date: 25 Mar 2002 01:14:46 EST

In HLA v2.0 I'm employing a combination of these techniques.


First, I use memory mapped files (Windows and Linux) so I don't really
have buffering problems (not to mention, memory mapped files are
faster, based on my experiments, and they tremendously simplify my
lexer which is written in assembly). Therefore, the buffer overflow
problem only occurs at the end of a given source file.


Like GCC and a few other compilers, I require that the source file end
with a newline (actually, any whitespace or single character lexeme
would be fine, but I explicitly check for a newline after mapping the
file and warn the user if one isn't present (*). If a newline is
present, then I don't have to check for the end of the buffer except
when scanning whitespace. If a newline is not present, then I scan
for EOF throughout the scanning process. In the general case, this
saves two instructions per character I process (which is very
significant in the lexer, since it uses only a few instructions to
process each character anyway).


(*) Okay, that's the theory, in practice here's how my lexer currently
works:


(1) I check for EOLN at the end of the file when mapping the file.
If it's not present, I warn the user that in some rare cases this
could actually crash the compiler (yep, poor engineering, but
I'll fix this problem later).
(2) If EOLN (or other white space) doesn't appear at the end of
the file, I rely on the fact that both Linux and Windows put zeros
in the left-over bytes of the last page of the file mapped into memory.
I can get into trouble if the file contains an even multiple of 4,096
bytes and the next page in memory is unpaged (or worse, contains
data that seems to be a continuation of the token). This situation
is rare enough that I'm willing to ignore it right now (this is the
problem I noted above that I will fix later).


Now my tokens can be the entire size of the file (though, as John
points out, this is unlikely in a real language).


The fix that I am contemplating, is to have two lexers and select one
or the other based upon the presence/absence of a newline (or other
suitable terminating character) at the end of the source file.


Randy Hyde


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.