Related articles |
---|
Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24) |
Re: Buffered input for a lexer? zackw@panix.com (Zack Weinberg) (2002-03-24) |
Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-24) |
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24) |
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24) |
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-03-25) |
Re: Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-25) |
[17 later articles] |
From: | Chris Lattner <sabre@nondot.org> |
Newsgroups: | comp.compilers |
Date: | 24 Mar 2002 11:07:52 -0500 |
Organization: | University of Illinois at Urbana-Champaign |
Keywords: | lex, question |
Posted-Date: | 24 Mar 2002 11:07:52 EST |
Are there any well known techniques that are useful to provide buffered input
for a lexer, without imposing arbitrary restrictions on token size? As I see
it, a traditional DFA based lexer has a few options to simplify things:
1. Disallow a specific character (usually ascii 0) in the input. The buffer
can be terminated with this character so that overruns are easily detected.
If an overrun occurs, the buffer can be extended, and the lexer restarted.
This has the disadvantage of preventing the null character from being
lexed.
2. Impose an arbitrary restriction on token size (I think lex does this, for
example). Thus as long as your buffer is bigger than the max token size,
the lexer doesn't have to worry about anything.
3. Add an explicit check to the DFA traversal loop to check to see if the
input buffer has been overrun. In this setup, you represent the input as
a range of characters, allowing you to recognize all characters as input,
but at a fairly significant runtime cost. I imagine that creative loop
unrolling would help mitigate the cost.
Alternatives #1 & #2 seem especially unattractive to me, and although #3
would work, it seems that there must be a better way. Has there been any
research into this area in the past? Is there any defacto standard way to
approach this issue?
Thanks,
-Chris
http://www.nondot.org/~sabre
[Flex uses a pair of large input buffers, 16K by default, with each token
having to be smaller than a buffer. For anything vaguely resembling a
programming language, I'd think a 16K token limit wouldn't be a problem.
If you really need to handle something longer, use start states and have
the lexer recognize it as several tokens that you then paste together in
your C code. -John]
Return to the
comp.compilers page.
Search the
comp.compilers archives again.