Re: Tokenizer theory and practice

Hans-Peter Diettrich <>
Sat, 17 May 2008 00:08:05 +0200

          From comp.compilers

Related articles
Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-13)
Re: Tokenizer theory and practice (cr88192) (2008-05-16)
Re: Tokenizer theory and practice (Dmitry A. Kazakov) (2008-05-16)
Re: Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice (cr88192) (2008-05-18)
Re: Tokenizer theory and practice (cr88192) (2008-05-18)
Re: Tokenizer theory and practice (Dmitry A. Kazakov) (2008-05-18)
Re: Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-18)
[1 later articles]
| List of all articles for this month |

From: Hans-Peter Diettrich <>
Newsgroups: comp.compilers
Date: Sat, 17 May 2008 00:08:05 +0200
Organization: Compilers Central
References: 08-05-050 08-05-065
Keywords: lex
Posted-Date: 17 May 2008 01:06:13 EDT

cr88192 schrieb:

>> Back to binary files, including PDF, ZIP and other file formats with
>> structured content, a multi-level parser seems to be required in most
>> cases, which can separate the file into distinct parts, which can (often
>> must) be processed in random order. I vaguely remember an approach to
>> formalize the description and decoding of binary files - can somebody
>> give me more precise hints in that direction?
> IMO, parsers of this sort (especially text-intended parser
> generators), are not a good approach to parsing binary data, nor do I
> believe are the formalisms often employed, which are themselves
> intended for textual structures.

That's why I mentioned multi-level parsers, where the top level handles
the file structure. In PDF files the file structure is textual, with
embedded binary streams. In other files the structure can be binary,
with embedded text snippets.

> Rather, I suggest looking at how such formats are actually structured,
> and how they are commonly represented within format specs, and try to
> derive from this a general notation, formalism, and possible
> implementation.

It would be nice to start with some already existing formalism.

> an issue I think:
> there are a wide variety of possible ways in which data can be, and often
> is, structured:
> uniform-sized blocks (common in filesystems, formats like GCF, B-Trees,
> .....);
> structures and offsets (ELF, COFF, ...);
> TLV structures (RIFF, PNG, ASN.1, ...);
> bitstreams (Deflate, ...);
> bytestreams (Java class, WBXML, many simpler LZ77 variants, ...);
> escaped tags (JPEG, MPEG, FLAC, ZIP, ...);
> .....
> So, rather than having a single formalism (such as LL or LR), more
> likely, one will end up with a kind of mini programming language just
> to specify how the pieces fit together.

I hope to separate the data structures, as syntactical elements, from
attributes etc. as kind of semantic code. In the best case it should be
possible to derive both an serializer and an de-serializer from a given
formal description.

> presumed features:
> structs (explicit on-disk layout), conditionals, support for variable length
> fields and members, ability to deal with file offsets and similar ideas, ...
> even then, there is likely to be a whole lot more that can't be formalized,
> than that can be.

I don't expect that a single formalism can cover really all weird file
formats. But it would be nice to have more than a description of the
basic "tokens", which cannot be much more than numbers or strings - in a
lot of different formats, of course. I don't want to decode bitstreams
or encrypted data, that would be a task for an automaton in an sub-parser.

> more likely, one can end up more with what would amount to a format-design
> tool, than something that can actually reliably parse existing formats.
> even then, it is usually not parsing the format that is the work, rather it
> is processing the data that is usually a lot more work.

The task of the parser is finished, when a file is converted into a tree
structure, possibly with additional tables for file offsets, tag names
and more.

> for example (just pulling something out of nowhere):
> struct PACK_Entry {
> char name[56];
> u32le offset;
> u32le size;
> byte@offset data[size];
> };
> struct PACK_Header {
> FOURCC magic="PACK";
> u32le offset;
> u32le ents;
> PACK_Entry@offset dir[ents];
> };

Such a description indeed will cover the most common data structures, of
a fixed or counted length. We'll have to add (zero) terminated items
(strings and other vectors), and multi-level descriptions for offsets
(relative to some file section, table lookup...). A "whitespace"
equivalent will be required as well, for the description of alignment
and padding.

> now, personally, I don't usually use such formalisms or tools, since
> personally I have usually found it to be most effective simply to
> write the parser myself, and if likely it were beyond my abilities to
> write a parser, it would also likely be somewhat beyond the abilities
> of these formalisms as well...

I've decoded many file formats myself, and my most useful tool was a
hex viewer with formatting capabilities. It could display files as
tables of fixed-size records, where the record formats were described
by strings, with 'c' for a character, 'l' for a longword etc. The
table positions and formats were written to an file, ready for later
reuse. What I'm still missing is the next step, with more
sophisticated types, and an ability to convert the fixed table offsets
and counts into computed elements, as templates for viewing files of
the same type. It may be more promising to extend that old tool with
the capabilities, mentioned above, instead of formalizing all aspects
of automated parsing of non-textual files. BTW, IDA is a fine example
for such a tool...


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.