Buffered input for a lexer?

Chris Lattner <sabre@nondot.org>
24 Mar 2002 11:07:52 -0500

          From comp.compilers

Related articles
Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? zackw@panix.com (Zack Weinberg) (2002-03-24)
Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-24)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-03-25)
Re: Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-25)
[17 later articles]
| List of all articles for this month |

From: Chris Lattner <sabre@nondot.org>
Newsgroups: comp.compilers
Date: 24 Mar 2002 11:07:52 -0500
Organization: University of Illinois at Urbana-Champaign
Keywords: lex, question
Posted-Date: 24 Mar 2002 11:07:52 EST

Are there any well known techniques that are useful to provide buffered input
for a lexer, without imposing arbitrary restrictions on token size? As I see
it, a traditional DFA based lexer has a few options to simplify things:


1. Disallow a specific character (usually ascii 0) in the input. The buffer
      can be terminated with this character so that overruns are easily detected.
      If an overrun occurs, the buffer can be extended, and the lexer restarted.
      This has the disadvantage of preventing the null character from being
      lexed.
2. Impose an arbitrary restriction on token size (I think lex does this, for
      example). Thus as long as your buffer is bigger than the max token size,
      the lexer doesn't have to worry about anything.
3. Add an explicit check to the DFA traversal loop to check to see if the
      input buffer has been overrun. In this setup, you represent the input as
      a range of characters, allowing you to recognize all characters as input,
      but at a fairly significant runtime cost. I imagine that creative loop
      unrolling would help mitigate the cost.


Alternatives #1 & #2 seem especially unattractive to me, and although #3
would work, it seems that there must be a better way. Has there been any
research into this area in the past? Is there any defacto standard way to
approach this issue?


Thanks,


-Chris


http://www.nondot.org/~sabre
[Flex uses a pair of large input buffers, 16K by default, with each token
having to be smaller than a buffer. For anything vaguely resembling a
programming language, I'd think a 16K token limit wouldn't be a problem.
If you really need to handle something longer, use start states and have
the lexer recognize it as several tokens that you then paste together in
your C code. -John]







Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.