Re: Tokenizer theory and practice

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Fri, 16 May 2008 18:23:45 +0200

From comp.compilers

Related articles
Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-13)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-16)
*Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov)* (2008-05-16)**
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice haberg_20080406@math.su.se (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-18)
[2 later articles]

| List of all articles for this month |

From:	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups:	comp.compilers
Date:	Fri, 16 May 2008 18:23:45 +0200
Organization:	cbb software GmbH
References:	08-05-050
Keywords:	lex
Posted-Date:	17 May 2008 01:05:23 EDT

On Tue, 13 May 2008 14:51:21 +0200, Hans-Peter Diettrich wrote:

> I'm currently thinking about lexing/tokenizing binary data and Unicode,
> in e.g. PDF files, and possible optimizations in LL/PEG parsers.
>
> Binary data typically comes in fixed length chunks, where the parser has
> to provide the length and encoding of the next token. This procedure is
> quite different from lexing text, that's why I prefer the term
> "tokenizer" in this context, and bottom-up (LR) parsers seem not to be
> very usable in this case. A lexer at best can skip over binary data,
> what often is sufficient, but then it should not be confused by embedded
> null "characters" in the input.

I don't really understand the problem, maybe, you can elaborate
it. Why NUL is a problem and why tokens need to be "raw text." When I
do similar stuff, I do it in a way that the parser returned typed
objects rather than copies of the source. The whole idea to copy the
source is bogus, IMO.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Tokenizer theory and practice

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>Fri, 16 May 2008 18:23:45 +0200

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Fri, 16 May 2008 18:23:45 +0200