Re: Inline block moves

christer@cs.umu.se (Christer Ericson)
Tue, 12 Nov 1991 07:48:31 GMT

          From comp.compilers

Related articles
Inline block moves disque@unx.sas.com (1991-11-11)
Re: Inline block moves mwm@pa.dec.comMeyer) (1991-11-11)
Inline block moves jfc@ATHENA.MIT.EDU (John Carr) (1991-11-11)
Inline block moves jfc@ATHENA.MIT.EDU (John Carr) (1991-11-12)
Re: Inline block moves christer@cs.umu.se (1991-11-12)
Re: Inline block moves Bruce.Hoult@actrix.gen.nz (1991-11-12)
Re: Inline block moves meissner@osf.org (1991-11-15)
| List of all articles for this month |

Newsgroups: comp.compilers
From: christer@cs.umu.se (Christer Ericson)
Keywords: assembler, optimize
Organization: Dep. of Info.Proc, Umea Univ., Sweden
References: 91-11-035
Date: Tue, 12 Nov 1991 07:48:31 GMT

In 91-11-035 disque@unx.sas.com (Thomas Disque) writes:
>Due to an overwhelming response, I am posting below my article on inline
>block moves. Please note that although I am describing optimal code as
>only hand assembly can craft it, I realize that compilers cannot always
>generate this code. I am simply outlining what I percieve as the ideal
>towards which we should strive. The previous statement was to prevent my
>being skewered by compiler writers more knowledgable than I am :-)


I'm not sure if you with "optimal" mean that the given code is the best
possible. I assume that optimal has the same relation to optimum as
maximal has to maximum; ie, that an optimal code is the best possible
version of the given code whereas the optimum code is the overall best
possible code. I'm not intimately familiar with all the presented
processors, but at least the code given for 6502 and Z80 were far from
optimum. (Not that I claim that the code I'm giving is optimum, but it is
at least more optimal.)


On the 6502 it is much faster to use self-modifying code and to put this
self-modifying code in zero-page for faster self-modification. The tight
loop would then look something like


@1 LDA FROM,Y ;4 cycles
@2 STA TO,Y ;5
DEY ;2
BNE @1 ;3/2
INC @1+2 ;3 (+1 if not in zero page)
INC @2+2 ;3 (+1 if not in zero page)
DEX ;2
BNE @1 ;3/2


as opposed to (your code)


@1 LDA (FROM),Y ;5 cycles
STA (TO),Y ;6
DEY ;2
BNE LOOP ;3/2
INC FROM+1 ;3
INC TO+1 ;3
DEX ;2
BNE LOOP ;3/2


Resulting in a gain of 2 cycles for each byte (or 255 out of 256 bytes
when the code isn't in zero page) moved, which on the 6502 must be
considered quite an improvement. Of course, self-modifying code wouldn't
be ROM'able so care must be taken when doing such an optimization. Another
advantage with the self-modifying version is that it doesn't use any zero
page memory for it's pointers.


For the Z80 program it would benefit from having the LDIR instruction
unrolled using one or more LDI's. That is


LD HL,(FROM)
LD DE,(TO)
LD BC,N
LDI
LDI
...
LDIR


instead of just


LD HL,(FROM)
LD DE,(TO)
LD BC,N
LDIR


Each LDI adds an extra two bytes of code while saving 1 M cycle/5 T
states. Of course the value of BC has to be accounted for so that the
correct number of bytes are moved.


I assume that similar optimizations can be found for the other code
snippets. However, one can discuss whether or not optimizations like
these should be done - especially if the code is meant to be inlined. Most
(if not all) optimizations increase the code size, which is ok (to a
certain limit) if it is a library routine, but for inlined code one would
want the inlined code to be as small as possible, while being as fast as
possible (which at least in these two given cases is a contradiction in
terms.)


| Christer Ericson Internet: christer@cs.umu.se |
| Department of Computer Science, University of Umea, S-90187 UMEA, Sweden |
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.