Re: Writing a disassembler ?

glen herrmannsfeldt <gah@ugcs.caltech.edu>
Sun, 12 Oct 2008 16:19:10 -0800

          From comp.compilers

Related articles
Writing a disassembler ? lightfault@gmail.com (So and so) (2008-10-10)
Re: Writing a disassembler ? j.vimal@gmail.com (Vimal) (2008-10-11)
Re: Writing a disassembler ? jeffrey.kenton@comcast.net (Jeff Kenton) (2008-10-11)
Re: Writing a disassembler ? DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-10-11)
Re: Writing a disassembler ? sh006d3592@blueyonder.co.uk (Stephen Horne) (2008-10-11)
Re: Writing a disassembler ? ArarghMail810@Arargh.com (2008-10-11)
Re: Writing a disassembler ? gah@ugcs.caltech.edu (glen herrmannsfeldt) (2008-10-12)
Re: Writing a disassembler ? lightfault@gmail.com (So and so) (2008-10-16)
Re: Writing a disassembler ? bc@freeuk.com (Bartc) (2008-10-16)
Re: Writing a disassembler ? ryanlunger@gmail.com (rlunger) (2008-10-18)
| List of all articles for this month |

From: glen herrmannsfeldt <gah@ugcs.caltech.edu>
Newsgroups: comp.compilers
Date: Sun, 12 Oct 2008 16:19:10 -0800
Organization: Compilers Central
References: 08-10-011
Keywords: disassemble
Posted-Date: 13 Oct 2008 05:33:42 EDT

So and so wrote:


> I've set myself a goal to write a disassembler, so far I've managed to
> understand most of Intel's documentation (for now it's only going to
> disassemble x86 code) and I'm about to start writing the basic
> skeleton.


> The algorithm I had in mind was :
> Read N bytes (or the whole files, I'm not sure about it yet)
> if this byte is part of a prefix instruction, parse it else continue
> to opcode, and so on and so on .


Disassemblers I have written, and seen written by others, work a
little differently.


First, read in the whole program to memory. That was usually possible
20 or 30 years ago, and should be even more possible today.


In addition to the memory for storing the program, use another array
to store what you know about that part of the program, initialize to zero.
Start at the entry point if known, record in the second array that is
the beginning of an instruction (maybe set to 1). Determine the opcode
and perform operation specific functions as described below. Indicate
in the second array bytes which are part of the instruction, but not
the beginning (maybe with 2). If the instruction was not an unconditional
branch (check the flags in the opcode specific table) continue with
the next instruction, checking to see that it hasn't been processed yet.
(Second array value is still zero).


Otherwise, if there are any addresses on the saved address stack,
remove one and continue from there. If there aren't, print out
the results and stop.


Depending on the flags in the opcode specific table, do some of the
following.


Branch: Add the destination address to the saved address stack.
                    (Requires knowing the addressing mode for some machines.)


Indirect branch: (Branch address can't be determined, such as
                    using the contents of a register or indexed by a register.)
                    Make a note to the user for manual determination.




At startup, you should read a list of addresses to initialize the
branch address stack. (Easiest is to also include the entry point.)


If there are indirect branches, manual searching for the appropriate
addresses (such as a table of addresses) and adding them to the initial
branch address stack is required. Usually only a few iterations of
running the disassembler and adding addresses will find all the actual
code.


The opcode specific table should have the nmemonic, address mode,
instruction length (possibly modified by address mode), enough to
print out the instruction in the appropriate assembler format.


On processors (such as the Z80) that have mostly one byte opcodes
but some two byte opcodes, there should be a separate table for
two byte opcodes, indicated by a flag in the one byte opcode table.


Bytes not identified as instructions should be printed in the
assembler specific form for hex constants, maybe eight or 16
on a line, or up to the next such boundary.


That should work for a large fraction of available byte addressable
machines. It should also work for word addressable machines with
word size a multiple of eight. With a little more work, for other
word sizes.


It should be mostly table driven, with new flags and addressing
modes added as needed.


-- glen


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.