Buffered input for a lexer?

Chris F Clark <cfc@world.std.com>
24 Mar 2002 12:26:56 -0500

From comp.compilers

Related articles
Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? zackw@panix.com (Zack Weinberg) (2002-03-24)
*Buffered input for a lexer? cfc@world.std.com (Chris F Clark)* (2002-03-24)**
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-03-25)
Re: Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-25)
Re: Buffered input for a lexer? clint@0lsen.net (2002-03-31)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-31)
[15 later articles]

| List of all articles for this month |

From:	Chris F Clark <cfc@world.std.com>
Newsgroups:	comp.compilers
Date:	24 Mar 2002 12:26:56 -0500
Organization:	Compilers Central
References:	02-03-162
Keywords:	lex, design
Posted-Date:	24 Mar 2002 12:26:56 EST

Chris Lattner asked:

> Are there any well known techniques that are useful to provide buffered input
> for a lexer, without imposing arbitrary restrictions on token size? As I see
> it, a traditional DFA based lexer has a few options to simplify things:
>
> 1. Disallow a specific character (usually ascii 0) in the input. The buffer
> can be terminated with this character so that overruns are easily detected.
> If an overrun occurs, the buffer can be extended, and the lexer restarted.
> This has the disadvantage of preventing the null character from being
> lexed.
> 2. Impose an arbitrary restriction on token size (I think lex does this, for
> example). Thus as long as your buffer is bigger than the max token size,
> the lexer doesn't have to worry about anything.
> 3. Add an explicit check to the DFA traversal loop to check to see if the
> input buffer has been overrun. In this setup, you represent the input as
> a range of characters, allowing you to recognize all characters as input,
> but at a fairly significant runtime cost. I imagine that creative loop
> unrolling would help mitigate the cost.

You can actually combine techniques 1 & 3 quite effectively. The
Yacc++ lexer will do this at revision 2.5 (we have a version of that
working but not released, because it isn't that target feature of the
2.5 release, just a small performance enhancement).

Have a specific character that acts as a termination character, but
don't disallow it at in the input. Only use it as a key when to check
the input buffer length. This is similar to your loop unrooling
technique for optimizing it. If you pick a "rare" character, one that
you don't expect to see in the input very often, you will do the check
for buffer full infrequently. The \0 character is often a good
choice, because it turns the entire buffer into a C string, and is
rarely used in C strings.

Note, you aren't really checking for input buffer overruns with the
termination character, just to see if the current token is bigger than
the buffer and you need to resize the buffer to read the whole token
in.

Input buffer overruns (that is writing past the end of the buffer)
should never occur in a well designed lexer because the lexer should
determine when to read characters in and never read more characters in
than can fit in the buffer (expanding the buffer if needed to fit the
current token in). This should even be true, if you write a "push"
style (event driven) lexer, as the Yacc++ lexer is. You can push as
many characters as you want into the Yacc++ lexer (provided that the
system will allocate storage for them) and the lexer grows the buffer
as needed to accept them. The termination sequence at the end of the
buffer is not part of the input you have pushed and only there to aid
the implementation. If you don't violate the published interface, one
cannot write beyond the end of the buffer while reading in characters.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
------------------------------------------------------------------------------

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Buffered input for a lexer?

Chris F Clark <cfc@world.std.com>24 Mar 2002 12:26:56 -0500

Chris F Clark <cfc@world.std.com>
24 Mar 2002 12:26:56 -0500