Re: Writing a disassembler ?

Vimal <>
Sat, 11 Oct 2008 17:53:13 +0530

          From comp.compilers

Related articles
Writing a disassembler ? (So and so) (2008-10-10)
Re: Writing a disassembler ? (Vimal) (2008-10-11)
Re: Writing a disassembler ? (Jeff Kenton) (2008-10-11)
Re: Writing a disassembler ? (Hans-Peter Diettrich) (2008-10-11)
Re: Writing a disassembler ? (Stephen Horne) (2008-10-11)
Re: Writing a disassembler ? (2008-10-11)
Re: Writing a disassembler ? (glen herrmannsfeldt) (2008-10-12)
Re: Writing a disassembler ? (So and so) (2008-10-16)
[2 later articles]
| List of all articles for this month |

From: Vimal <>
Newsgroups: comp.compilers
Date: Sat, 11 Oct 2008 17:53:13 +0530
Organization: Compilers Central
References: 08-10-011
Keywords: disassemble
Posted-Date: 11 Oct 2008 08:55:33 EDT


2008/10/10 So and so <>:
> I've set myself a goal to write a disassembler, so far I've managed to
> understand most of Intel's documentation (for now it's only going to
> disassemble x86 code) and I'm about to start writing the basic
> skeleton.

Nice. I have never written any disassembler before, my views are just
from a programmer's point of view.

> The algorithm I had in mind was :
> Read N bytes (or the whole files, I'm not sure about it yet)

You could have a single interface to read a character (read_char())
and let that function take care of buffering up IO and such. It could
be something like:

    if buffer is empty:
        read bufsize amount into the buffer from the file
        return the character at the current position, position++

> if this byte is part of a prefix instruction, parse it else continue
> to opcode, and so on and so on .

Instruction prefixes can be longer than 1 byte. They can be a max of 4
bytes (I think). So, you would have something like a state machine

> Though, I ran into several "problems" in mind:
> 1. Which data structure should store the values I read ? A hash table
> or a Tree ? Or a combination of both ? (trie) Should the tree be
> balanced ? If not, will it cost in efficiency or whether balancing it
> will cost in efficiency ?

A multi-level hash combining different techniques would be the best
bet. For this, it is important to see the distribution of instruction
length across all instructions. Common instructions tend to be smaller
in length, and you would want very fast decoding for this.
Conceptually it's a tree, but the way you implement the tree matters.

For example, you could hash into the first two bytes of a prefix (if
at all it's a prefix):

prefix[1 << 16] = array of some datastructures containing pointers to
further processing.

Try to keep expensive operations like trie-searching at the lowest level.

A good example of multi-level hashing is Linux Kernel Timers, where
timers are hashed on their time of execution in a nice way.

> 5. Should the disassembler itself be multi threaded or one program
> which does everything step-by-step and if it will be multi threaded -
> how can I handle or parse different instructions ? or handle
> synchronization ?

Executables have sections built into them. You could parallelize
disassembling each section! In fact, this is a good example of a data
parallelism where you're exploiting the structure of your data allows
it to be parallelized easily.

In fact, as soon as you an instruction boundary, (not just sections,
but also procedures, or in fixed-length intstruction archs, which
unfortunately Intel isn't) you can independently parse from that

Synchronization wouldn't naturally arise. But, you would want your
other disassembling threads to finish before you resolve names, unless
you're content with:
    CALL 0xdeadbeef

instead of:
    CALL __function_name

You can do this intelligently, but a 2-pass disassembling would be the
easiest solution. First pass to disassemble code, second to patch up

Hope that helps,

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.