Re: Buffered input for a lexer?

Zack Weinberg <zackw@panix.com>
24 Mar 2002 12:25:54 -0500

          From comp.compilers

Related articles
Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? zackw@panix.com (Zack Weinberg) (2002-03-24)
Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-24)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? sabre@nondot.org (Chris Lattner) (2002-03-24)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-03-25)
Re: Buffered input for a lexer? cfc@world.std.com (Chris F Clark) (2002-03-25)
Re: Buffered input for a lexer? clint@0lsen.net (2002-03-31)
[16 later articles]
| List of all articles for this month |

From: Zack Weinberg <zackw@panix.com>
Newsgroups: comp.compilers
Date: 24 Mar 2002 12:25:54 -0500
Organization: PANIX -- Public Access Networks Corp.
References: 02-03-162
Keywords: lex
Posted-Date: 24 Mar 2002 12:25:54 EST

Chris Lattner <sabre@nondot.org> writes:
>Are there any well known techniques that are useful to provide buffered input
>for a lexer, without imposing arbitrary restrictions on token size? As I see
>it, a traditional DFA based lexer has a few options to simplify things:
>
>1. Disallow a specific character (usually ascii 0) in the input. The buffer
> can be terminated with this character so that overruns are easily detected.
> If an overrun occurs, the buffer can be extended, and the lexer restarted.
> This has the disadvantage of preventing the null character from being
> lexed.
>3. Add an explicit check to the DFA traversal loop to check to see if the
> input buffer has been overrun. In this setup, you represent the input as
> a range of characters, allowing you to recognize all characters as input,
> but at a fairly significant runtime cost. I imagine that creative loop
> unrolling would help mitigate the cost.


You can combine these approaches to mitigate the problems with either.
When you encounter a sentinel character, check the input pointer against
the buffer; if it's at the end, absorb more input and restart, otherwise
proceed to process the character normally. This works well as long as
the sentinel character appears only rarely in normal input (for source
code, ascii 0 is a good choice).


>[Flex uses a pair of large input buffers, 16K by default, with each token
>having to be smaller than a buffer. For anything vaguely resembling a
>programming language, I'd think a 16K token limit wouldn't be a problem.
>If you really need to handle something longer, use start states and have
>the lexer recognize it as several tokens that you then paste together in
>your C code. -John]


Ridiculously huge tokens can show up in machine-generated code.
The example I'm familiar with is the C++ name mangling scheme used
by old (pre-3.0) GCC. This regularly produced symbols which overran
input buffers in various assemblers. I don't remember just how big
they got, but more than 16K would not surprise me.


zw
[16K identifiers? Really? Name mangling usually adds only one character
per argument to the original name. -John]



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.