Re: Tokenizer theory and practice

"cr88192" <cr88192@hotmail.com>
Sun, 18 May 2008 09:06:04 +1000

From comp.compilers

Related articles
Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-13)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-16)
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-16)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice haberg_20080406@math.su.se (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
*Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192)* (2008-05-18)**
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-18)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-18)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-20)

| List of all articles for this month |

From:	"cr88192" <cr88192@hotmail.com>
Newsgroups:	comp.compilers
Date:	Sun, 18 May 2008 09:06:04 +1000
Organization:	Saipan Datacom
References:	08-05-050 08-05-066 08-05-069
Keywords:	lex
Posted-Date:	18 May 2008 01:05:38 EDT

"Hans-Peter Diettrich" <DrDiettrich1@aol.com> wrote in message
> Dmitry A. Kazakov schrieb:
>
>>> Binary data typically comes in fixed length chunks, where the parser has
>>> to provide the length and encoding of the next token. This procedure is
>>> quite different from lexing text, that's why I prefer the term
>>> "tokenizer" in this context, and bottom-up (LR) parsers seem not to be
>>> very usable in this case. A lexer at best can skip over binary data,
>>> what often is sufficient, but then it should not be confused by embedded
>>> null "characters" in the input.
>>
>> I don't really understand the problem, maybe, you can elaborate
>> it. Why NUL is a problem and why tokens need to be "raw text."
>
> These are problems of many (C based) lexer generators. A conversion,
> into anything but an built-in data type, requires semantic code, which
> makes a formal grammar usable only with the language of that code. Few
> generators come with a built-in language for doing such conversions,
> like MetaS or TextTransformer.
>
>> When I do similar stuff, I do it in a way that the parser returned
>> typed objects rather than copies of the source. The whole idea to
>> copy the source is bogus, IMO.
>
> Indeed, textual copies are of little use. Can you suggest a
> descriptive formalism for the objects, returned by an lexer?
>

not sure if useful, but, in my recent incarnations of my compiler, the
parse, and many other subsystems, are making use of XML as the native
representation.

sadly, though XML is acceptable for textual inputs, for binary formats it is
less sensible.

another possible notation is a variation of S-Expressions, which tends to be
a lot more compact and concise than XML (both notationally, and in terms of
memory usage as well).

I had also used them sometimes within my compiler as well. the main
disadvantage of s-exps, however, is that they tend to be far less flexible
and annotatable than XML (this has often been an issue in my case).

with XML-based designs, often one can add new tags and attributes as needed
without needing to change existing code, but s-exp based notations tend to
lack this ability, usually requiring things to be explicitly designed and
requiring structural alteration (aka: lots of code modification...) to deal
with simple design changes or feature additions (sadly... it is often more
painful to deal with this issue with s-exps than with raw structs).

a kind of hybrid system could also be possible, combining both
binary-encoded datums (such as integer values) with a more open-ended
structure (tag based rather than positional).
this is possible with s-exps, albeit it is not common practice (it being
much more common to use then with a good old LISP-like positional
structure...).

also possible would be, rather than directly working with s-exps, making use
of a kind of schema, such that we can change the schema and also the
structure, without having to change all of the code (but, then, we almost
may as well be using a binary representation...).

or such...

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Tokenizer theory and practice

"cr88192" <cr88192@hotmail.com>Sun, 18 May 2008 09:06:04 +1000

"cr88192" <cr88192@hotmail.com>
Sun, 18 May 2008 09:06:04 +1000