Re: Tokenizer theory and practice

Hans-Peter Diettrich <DrDiettrich1@aol.com>
Sat, 17 May 2008 10:22:34 +0200

From comp.compilers

Related articles
Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-13)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-16)
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-16)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice haberg_20080406@math.su.se (Hans Aberg) (2008-05-17)
*Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich)* (2008-05-17)**
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-18)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-18)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-20)

| List of all articles for this month |

From:	Hans-Peter Diettrich <DrDiettrich1@aol.com>
Newsgroups:	comp.compilers
Date:	Sat, 17 May 2008 10:22:34 +0200
Organization:	Compilers Central
References:	08-05-050 08-05-066
Keywords:	lex
Posted-Date:	17 May 2008 16:36:13 EDT

Dmitry A. Kazakov schrieb:

>> Binary data typically comes in fixed length chunks, where the parser has
>> to provide the length and encoding of the next token. This procedure is
>> quite different from lexing text, that's why I prefer the term
>> "tokenizer" in this context, and bottom-up (LR) parsers seem not to be
>> very usable in this case. A lexer at best can skip over binary data,
>> what often is sufficient, but then it should not be confused by embedded
>> null "characters" in the input.
>
> I don't really understand the problem, maybe, you can elaborate
> it. Why NUL is a problem and why tokens need to be "raw text."

These are problems of many (C based) lexer generators. A conversion,
into anything but an built-in data type, requires semantic code, which
makes a formal grammar usable only with the language of that code. Few
generators come with a built-in language for doing such conversions,
like MetaS or TextTransformer.

> When I do similar stuff, I do it in a way that the parser returned
> typed objects rather than copies of the source. The whole idea to
> copy the source is bogus, IMO.

Indeed, textual copies are of little use. Can you suggest a
descriptive formalism for the objects, returned by an lexer?

DoDi

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Tokenizer theory and practice

Hans-Peter Diettrich <DrDiettrich1@aol.com>Sat, 17 May 2008 10:22:34 +0200

Hans-Peter Diettrich <DrDiettrich1@aol.com>
Sat, 17 May 2008 10:22:34 +0200