Re: Tokenizer theory and practice

"cr88192" <>
Tue, 20 May 2008 19:54:45 +1000

          From comp.compilers

Related articles
[4 earlier articles]
Re: Tokenizer theory and practice (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice (cr88192) (2008-05-18)
Re: Tokenizer theory and practice (cr88192) (2008-05-18)
Re: Tokenizer theory and practice (Dmitry A. Kazakov) (2008-05-18)
Re: Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-18)
Re: Tokenizer theory and practice (cr88192) (2008-05-20)
| List of all articles for this month |

From: "cr88192" <>
Newsgroups: comp.compilers
Date: Tue, 20 May 2008 19:54:45 +1000
Organization: Saipan Datacom
References: 08-05-050 08-05-065 08-05-067 08-05-071 08-05-074
Keywords: lex
Posted-Date: 20 May 2008 11:21:22 EDT

"Hans-Peter Diettrich" <> wrote in message
> cr88192 schrieb:
>>> I hope to separate the data structures, as syntactical elements, from
>>> attributes etc. as kind of semantic code. In the best case it should be
>>> possible to derive both an serializer and an de-serializer from a given
>>> formal description.
>> ok.
>> so, one first describes all of the pieces, and then later how they are
>> assembled?...
>> ok, but it may be difficult to pull off...
> You are right. Looking at some popular binary file formats, the
> construction (writing) must be done sequentially, with possible
> patches of offsets in preceding tables. When an offset table is
> written last, it must be read in first, from the end of the file. I'm
> not sure whether a formal description of such procedures is possible,
> for use in both reading and writing such an file. An according grammar
> may become context sensitive, what classifies the problem as very
> interesting from the scientific point of view, but it's unlikely that
> it will result in a usable tool.

yes, this is one problem.

with format writing, the problem is made a little worse, for example, in the
case of formats which can't easily be written sequentially, or have a
non-trivial representation for storage management.

this would seem to be the case, for example, for filesystem-like container

>> as noted, I also said that there would be conditionals, that or we could
>> also support BNF-based descriptions.
>> u64 uvli() {
>> byte v;
>> return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f));
>> };
>> u64 svli() {
>> uvli v;
>> return((v&1)?(-((v+1)>>1)):(v>>1));
>> }
>> of course, this does not make it clear how to encode these types...
> A common example would be UTF-8 encoding/decoding, which is easily
> described in a procedural way, and possibly also in a grammar, but
> deriving the code from such a grammar seems to exceed my capabilities.
> <sigh>

yeah. likely any values of the sort would need builtin types, leading to
incompleteness issues (the formalism being unable to even represent many of
its basic constructs).

or, at least this would be the abstract case. including a C-like
mini-language for writing serializers and deserializers would be helpful,
but would reduce this from the level of a formalism to a special purpose
programming language.

>> your intent is for reverse engineering or something?...
>> that is what this sounds like to me at least...
> That's the background, how I came to parsers at all ;-)

I came from the direction of writing load/save code, and later interpreters.

long ago, some of my more-complex formats had a vaguely POV-ray style
syntax, and I later write a scheme interpreter. my skills gradually
improved, and my ability to write parsers somewhat improved.

2001: POV-ray like syntax (fileformats);
2002: Scheme interpreter
2003: well, graduated HS, worth note as a reference point, Scheme
interpreter inflated into an unmaintainable horror...
2004: new language, JS-like syntax (remaining Scheme influences)
2006: new language, C/JS hybrid style (experimentally used JIT compilation)
2007: harsh jump, compiler for C proper
2008: gradually rewriting the C compiler piece by piece it seems...

note that jumping from a JavaScript-like language to C is far more work than
would seem implied by the syntex (internally, C is a beast which is severely
different WRT the internals).

>> usually for more complex or bulky formats, I write special tools...
> After considering all the topics, mentioned in this thread, I better
> leave the theory to other people, too. As you stated before:
> >>
> more likely, one can end up more with what would amount to a
> format-design tool, than something that can actually reliably parse
> existing formats.
> <<


sadly, I am not even really sure if a general solution is possible, albeit
clean design of fileformats is a worthwhile goal, I am not even so sure this
is possible (AFAIK ASN.1 was one attempt at this problem, but it is far from
a universal solution IMO).

possibly relevant (I have not really looked much into it), a format known as

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.