Re: Buffered input for a lexer?

"Randall Hyde" <rhyde@cs.ucr.edu>
20 Apr 2002 22:57:10 -0400

          From comp.compilers

Related articles
[14 earlier articles]
Re: Buffered input for a lexer? cgweav@aol.com (2002-04-13)
Re: Buffered input for a lexer? ralph@inputplus.co.uk (2002-04-16)
Re: Buffered input for a lexer? joachim_d@gmx.de (Joachim Durchholz) (2002-04-16)
Re: Buffered input for a lexer? cgweav@aol.com (2002-04-17)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-04-19)
Re: Buffered input for a lexer? monnier+comp.compilers/news/@RUM.cs.yale.edu (Stefan Monnier) (2002-04-19)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-04-20)
Re: Buffered input for a lexer? joachim_d@gmx.de (Joachim Durchholz) (2002-04-23)
Re: Buffered input for a lexer? bear@sonic.net (Ray Dillinger) (2002-04-23)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-04-23)
| List of all articles for this month |

From: "Randall Hyde" <rhyde@cs.ucr.edu>
Newsgroups: comp.compilers
Date: 20 Apr 2002 22:57:10 -0400
Organization: Prodigy Internet http://www.prodigy.com
References: 02-04-061 02-04-081 02-04-097 02-04-116 02-04-117
Keywords: lex, practice
Posted-Date: 20 Apr 2002 22:57:10 EDT

Stefan Monnier wrote in message 02-04-117...
>>>>>> "Randall" == Randall Hyde <rhyde@cs.ucr.edu> writes:
>> [How about looking at the end of the file, and using a special slow
>> version of the lexer if the eof isn't a newline? -John]


That's certainly a plausible solution and could be done without
any additional expense (I could patch the jump table that hashes
off the first character of a lexeme and jumps to the code that
might fail on the lack of a delimiter).


OTOH, I've checked every HLA module I've got on hand (over
100,000 lines of code) and not a single source file even comes
close to satisfying all the constraints needed to produce the
warning from the lexer. As a professional engineer, it certainly
bothers me that I've designed in this "flaw"; in practice, however,
it rarely (1 in 10k-100K) comes up and when it does, it offers the
user a trivial work-around.


Probably what I will do is comment the source code with a suggestion
about how to deal with this problem and come back to it later when
there are less pressing problems to deal with.


>You mean, copy the text, add the missing newline and go back
>to the fast version ?
>
> Stefan
>[Well, no, but that's probably a better approach. -John]


In my particular case, "copy the text" would mean copying the entire
source file (I only operate in a read-only memory mapped environment
and I don't assume that I have access to the page of memory
immediately following the last page of the source file). Since it
takes more than two instructions per character to copy the source
file, it would be cheaper to simply test for EOF after each character
(in my particular case).


Obviously, if I could write to the memory-mapped region *without
affecting the original source file* and I could extend the region by
one page, my problems would be trivially solved. But I'm working
under the assumption that such activity is not portable across
operating systems.


BTW, in private email, I was asked if those two instructions would
really make a difference given the complexity of the rest of the
lexer. To answer that question, here's the sample code sequence that
collects an identifier/reserved word lexme:


// Note that we've already determined that the first character is
// an alphabetic character (via a jump table [switch]):
// Also, the upper three bytes of EAX are known to contain zero
// at this point. ESI and EDI point at the character we just processed in
// in the memory mapped source file.


HandleID:
                inc( esi );
                mov( [esi], al );
                bt( eax, IDcset );
                jc HandleID;


// At this point, EDI..ESI-1 bracket the identifier.




The code to (properly) check for EOF after each character looks like this:


HandleID:
        inc( esi );
        cmp( esi, EOF );
        jae IDdone;
        mov( [esi], al );
        bt( eax, IDcset );
        jc HandleID
IDdone:


Two machine instructions may not seem like much, but when you go from
four to six instructions, it can be a problem (50% more instructions,
or 33% fewer instructions, depending upon which direction you want to
view it).


Note, btw, that the lexer *does* check for EOF before processing the
first character. The problem with this code is that it doesn't check
for EOF in the middle of an identifier. Therefore, this code expects
some sort of sentinel character (non-ID character) to terminate the
identifer. If the end of the identifier falls on a page boundary and
that also happens to be the EOF, problems can develop.


As John suggests, I could change the entry in the jump table that
transfers control to HandleID so that it points at the second version
rather than the first if there the code determines that the special
condition exists (the source file ends with an ID and no sentinel
character follows). This would mean that once in a (really) great
while the lexer would run a little slower, but most of the time it
would run at full speed.


Note, btw, that the current version of HLA (v1.xx) does not use this
algorithm. Indeed, the current HLA lexer (written with FLEX) is
digustingly slow (especially when processing macros and other code
that requires reprocessing source code the compiler has already seen).
Randy Hyde


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.