Re: Writing a disassembler ?

Stephen Horne <sh006d3592@blueyonder.co.uk>
Sat, 11 Oct 2008 20:29:55 +0100

          From comp.compilers

Related articles
Writing a disassembler ? lightfault@gmail.com (So and so) (2008-10-10)
Re: Writing a disassembler ? j.vimal@gmail.com (Vimal) (2008-10-11)
Re: Writing a disassembler ? jeffrey.kenton@comcast.net (Jeff Kenton) (2008-10-11)
Re: Writing a disassembler ? DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-10-11)
Re: Writing a disassembler ? sh006d3592@blueyonder.co.uk (Stephen Horne) (2008-10-11)
Re: Writing a disassembler ? ArarghMail810@Arargh.com (2008-10-11)
Re: Writing a disassembler ? gah@ugcs.caltech.edu (glen herrmannsfeldt) (2008-10-12)
Re: Writing a disassembler ? lightfault@gmail.com (So and so) (2008-10-16)
Re: Writing a disassembler ? bc@freeuk.com (Bartc) (2008-10-16)
Re: Writing a disassembler ? ryanlunger@gmail.com (rlunger) (2008-10-18)
| List of all articles for this month |

From: Stephen Horne <sh006d3592@blueyonder.co.uk>
Newsgroups: comp.compilers
Date: Sat, 11 Oct 2008 20:29:55 +0100
Organization: virginmedia.com
References: 08-10-011
Keywords: disassemble
Posted-Date: 12 Oct 2008 08:18:34 EDT

On Fri, 10 Oct 2008 16:57:31 +0200, "So and so" <lightfault@gmail.com>
wrote:


>1. Which data structure should store the values I read ? A hash table
>or a Tree ? Or a combination of both ? (trie) Should the tree be
>balanced ? If not, will it cost in efficiency or whether balancing it
>will cost in efficiency ?


These are associative data structures, associating keys with data. You
might need something like this for referenced addresses, but the set
of _potentially_ referenced addresses will be very densely packed -
I'd probably use a bitvector (or else some kind of mapping that leads
to single-integer mini-bitvectors for 32 or 64-byte address ranges).
Each flag indicates that a valid instruction starts here.


You don't need to associate label-names with addresses unless you
allow the user to rename them - just use a prefix followed by the hex
address, so you don't need a lookup at all.


For parsing, I don't think you need a data structure at all. Just use
a switch/case instruction to interpret each component, selecting by
offset-from-start and masking as needed. Nest the switch statements to
form a simple decision tree.


In fact, even that's probably the hard way. The easy way is to
download a good regular-grammar-parsing code generator such as Ragel
(http://www.complang.org/ragel/). This tool is easily capable of
interpreting binary files. For variable parts of each opcode (register
identifiers etc) just make sure all cases are covered, and pick out
that information in your rule-recognised actions.


Actually covering all those variable-part cases could be fiddly since
there is no support for deriving sets of recognised bytes using
bitwise or arithmetic manipulation of codes that I recall, but you
could always use a scripting language to generate parts of the Ragel
spec.


You might do well to read up on regular grammar handling techniques.
The standard reference is "Introduction to Automata Theory, Languages,
and Computation" by Hopcroft, Motwani and Ullman - which I can't
afford ATM and have never read. But there's plenty to read elsewhere
if you look. Try the "Algorithmic Forays" section from the gamedev.net
site (http://www.gamedev.net/reference/list.asp?categoryid=25).


However, full regular grammar handling would be overkill. Since you
need to identify the instructions, your state machines will have a
tree structure - ie this is another way to design and implement the
decision trees. Ragel may even generate very similar code to that you
would have written, depending on the options you specify.


It's also worth looking at the pattern matching available in
functional languages such as Haskell and Objective Caml.


As far as the data structure is concerned, keep it simple, use a
standard library if possible - and consider that you may not need a
data structure at all since directly interpreting the binary on demand
is very fast.


>5. Should the disassembler itself be multi threaded or one program
>which does everything step-by-step and if it will be multi threaded -
>how can I handle or parse different instructions ? or handle
>synchronization ?


There was once a free version of IDA Pro. It's pretty out of date
though - more recent versions are probably available as time-limited
demos.


I strongly recommend you play a little bit.


IDA Pro does interactive disassembling. It loads a file and displays
it immediately, but also starts disassembling in the background. It
identifies instruction start points automatically by following the
flow of execution, including both following and skipping over
conditional branches for instance. But it also allows the user to
override its decisions, rename labels and so on.


This is the *only* kind of disassembler that I can imagine using
explicit multi-threading. A simple one-pass disassembler is exactly
that - it dumps out the disassembly of each instruction as soon as it
recognises it. A multi-pass disassembler could do better at
recognising code-blocks from backward jumps, spotting how data blocks
are referenced and therefore deciding whether the data is best
represented as byte, word, float, string or whatever, and other
tweaks. Either way, I doubt there's any benefit from multi-threading.
After all, even with multi-pass approaches, each pass needs all the
data from the previous pass, so later passes cannot start until
earlier passes have been completed.


However, a simpler disassembler may be able to exploit implicit
multi-threading of the kind that some compilers can organise for you,
and which I tend to think of as a bit like what processors do to run
instructions out-of-order, but much more so.


If you really need the best possible optimisation for this, I'd
suggest using Glasgow Haskell. It may take some getting used to for
imperitive programmers, but the deep optimisation and so on can pay
off big.


That said, the real bottleneck will almost certainly be the output of
disassembled code anyway. The input binaries are far more compact, and
the parsing should be very fast no matter how naively you implement
it. About the worst thing you could do is have overcomplex data
structures that thrash the virtual memory when disassembling
multi-megabyte binaries - other than that you should have no problem.


I probably have some old disassembler source codes on reference CD
ROMS from the days of DOS and 16 bit Windows. If you want, I can dig
out one or two for you to look at.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.