Re: Writing a disassembler ?

"Bartc" <bc@freeuk.com>
Thu, 16 Oct 2008 10:15:03 GMT

          From comp.compilers

Related articles
[2 earlier articles]
Re: Writing a disassembler ? jeffrey.kenton@comcast.net (Jeff Kenton) (2008-10-11)
Re: Writing a disassembler ? DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-10-11)
Re: Writing a disassembler ? sh006d3592@blueyonder.co.uk (Stephen Horne) (2008-10-11)
Re: Writing a disassembler ? ArarghMail810@Arargh.com (2008-10-11)
Re: Writing a disassembler ? gah@ugcs.caltech.edu (glen herrmannsfeldt) (2008-10-12)
Re: Writing a disassembler ? lightfault@gmail.com (So and so) (2008-10-16)
Re: Writing a disassembler ? bc@freeuk.com (Bartc) (2008-10-16)
Re: Writing a disassembler ? ryanlunger@gmail.com (rlunger) (2008-10-18)
| List of all articles for this month |

From: "Bartc" <bc@freeuk.com>
Newsgroups: comp.compilers
Date: Thu, 16 Oct 2008 10:15:03 GMT
Organization: Compilers Central
References: 08-10-011
Keywords: disassemble
Posted-Date: 16 Oct 2008 21:01:54 EDT

"So and so" <lightfault@gmail.com> wrote in message
>
> I've set myself a goal to write a disassembler, so far I've managed to
> understand most of Intel's documentation (for now it's only going to
> disassemble x86 code)


Only x86? That's probably the most complex.


The first docs you need will be a listing of the 256 possible values of the
initial instruction byte. Once you've decoded this, you will know the
instruction or instruction group, and can go from there. Intel docs tend to
be complex, but there are plenty of these lists around from other sources.


> and I'm about to start writing the basic
> skeleton.
>
> The algorithm I had in mind was :
> Read N bytes (or the whole files, I'm not sure about it yet)


This is your first problem. My idea of a disassembler would have a start
address as input not a file (and so the code would reside in memory).


If you have a file, then what sort of file is this: executable, object code,
etc? Then this will have it's own structure you need to decode before you
get to the code.


You also need a way of recognising the end (otherwise disassembling will
continue until you run off the end of memory).


How big are the files you're processing? Is the output going to be
human-readable syntax, or are you doing some analysis of the code? I've
assumed here the former, in which case there is no point in disassembling a
1GB input file as it might take a decade or two for someone to read it...


> if this byte is part of a prefix instruction, parse it else continue
> to opcode, and so on and so on .
>
> Though, I ran into several "problems" in mind:
>
> 1. Which data structure should store the values I read ? A hash table
> or a Tree ? Or a combination of both ? (trie) Should the tree be
> balanced ? If not, will it cost in efficiency or whether balancing it
> will cost in efficiency ?


I don't understand; why do you want to store a linear set of disassembled
instructions in a tree? Usually the output of a disassembler is a textual
representation of the code (and you need to choose which kind of syntax to
display this).


If not converting to text straightaway, you might as well keep as
undisassembled code! That's pretty compact. Perhaps just a series of
pointers to the start and end of code blocks. When you're ready to output,
disassemble each block.


>
> 2. What about invalid instructions ? Should I strip them the moment I
> detect they're invalid or should they be stored FFU ?


The main problem is not invalid instructions, but bytes that are not
instructions: either data or garbage. You will have to devise a way of
dealing with these and finding the start of the next valid instruction.


On x86, a zero byte is the start of a valid (but rare) instruction, but you
will soon recognise which one! And there also a lot of different different
Intel cpu types, I suppose some codes will be valid instructions for some,
but not for others (the last time I did a disassembler was for 186)


What's an FFU?


>
> 3. Which data structure should hold the final result of the
> disassembled instruction ?


See my answer above.


> 5. Should the disassembler itself be multi threaded or one program
> which does everything step-by-step and if it will be multi threaded -
> how can I handle or parse different instructions ? or handle
> synchronization ?


I won't ask why on earth you want to complicate your project this way. I
guess you're an expert in multi-threading and now want to use it for
everything. Perhaps get any disassembler working first then worry about
multi-threading later; if you must...


--
Bartc



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.