Inline block moves

John Carr <jfc@ATHENA.MIT.EDU>
Mon, 11 Nov 91 21:20:03 EST

          From comp.compilers

Related articles
Inline block moves (1991-11-11)
Re: Inline block moves mwm@pa.dec.comMeyer) (1991-11-11)
Inline block moves jfc@ATHENA.MIT.EDU (John Carr) (1991-11-11)
Inline block moves jfc@ATHENA.MIT.EDU (John Carr) (1991-11-12)
Re: Inline block moves (1991-11-12)
Re: Inline block moves (1991-11-12)
Re: Inline block moves (1991-11-15)
| List of all articles for this month |

Newsgroups: comp.compilers
From: John Carr <jfc@ATHENA.MIT.EDU>
Keywords: assembler, optimize
Organization: Compilers Central
References: 91-11-035
Date: Mon, 11 Nov 91 21:20:03 EST

The article on inline block moves doesn't mention some RISC processors on
which block moves are best done using different methods. The IBM RT and
MIPS CPUs allow multiple loads and stores to be executing in parallel, and
the IBM RS/6000 has hardware support for unaligned and odd-sized block


The IBM RT allows overlapped loads and stores: a load or store takes 5
cycles to complete, but the instruction itself takes only 1 (load) or 2
(store) cycles. If a register is referenced before the load completes, the
processor stalls. Two memory operations can be pending; issuing a load
while two loads are executing stalls the processor until the first load
completes. While verifying the instruction timings given in the technical
reference I discovered one undocumented feature of the RT: there is a 2
cycle delay if a store follows a store (i.e. it is faster to insert a no-op
between them).

The RT doesn't have a conventional cache, but it does have a 16 byte
instruction buffer used for short loops. A branch normally takes 5 cycles,
but the branch at the end of a loop that fits within 4 words takes only 1

The best block move on an RT looks like:

get r0,$bytecount
lda r2,source
lda r3,dest-8
0: ls r4,0(r2) # this instruction must be word aligned
ls r5,4(r2)
inc r2,8
inc r3,8
sts r4,0(r3) # this stalls 1 cycle waiting for the load
sis r0,8 # subtract 8 from count and set condition codes
sts r5,4(r3)
jh 0b

Note that I put the first store earlier than it needs to be: it is faster to
have the store wait one cycle for the load to complete than it is to have
the 2 stores adjacent and take the 2 cycle penalty.

The RT has load-multiple and store-multiple instructions (does any IBM CPU
not have them?), but they are actually a little slower than this loop. An
unrolled loop is probably slower, because the instruction fetches when the
processor is not in "loop mode" offset the gains due to better instruction


The MIPS family of CPUs have a data cache, so loads are faster than on an
RT: a load instruction takes 1 cycle plus 1 more for the data to be
available. This means that, unlike the RT, the instructions to increment
the data pointers and test the loop count can not be overlapped with the
loads. For this reason it is best to unroll the loop. The library function
memcpy() moves 8 words at a time.

On the newer MIPS CPUs, loads take longer than 2 cycles so the details may
change (I don't know if 3 loads can be running at the same time, or if the
processor has a limit of 2 loads like the RT).


The RS/6000 has 3 methods of doing a block move. There is hardware support
for unaligned load/store-multiple and conventional aligned load/store
multiple instructions, in addition to normal load and store instructions.
The C library function memcpy() uses the unaligned load multiple
instructions. These have 2 advantages over the normal lm/stm: they do not
require aligned data, and they can write into any set of consecutive
registers (lm/stm require consecutive registers ending at r31). They have 1
disavantage: they are slower for aligned data (2 cycles longer, plus setup
time to load the byte count a special register).

An inline block move with a known size does not require the features of the
unaligned move instructions, but might use them anyway. A loop doing lm/stm
will be faster, but requires specific registers be written (since these
registers are preserved over function calls they are preferred to be used
for variables rather than as scratch registers).


These are the sizes, cycle count, and byte count for a block move loop (not
counting loop setup time). The 2 times for the RS/6000 are for using the
aligned or unaligned load multiple instruction.

words/iteration 8 8 2
cycles/iter 19 18-22 11
instruction bytes 76 20 16

Of these 3 machines, only the RS/6000 can easily do fast unaligned moves
inline (the MIPS takes 2 cycles to do an unaligned load, the RT has no
hardware support for unaligned moves).


Depending on your space/time constraints, inline block move may or may not
be worth implementing on these machines. On the RS/6000 a block move can be
generated without concern for operand alignment, it is only slightly more
code than a function call, but it is not much faster than a function call
(the function has ~10 cycles setup time that are not needed for the inline
code with a known byte count). On the RT inline a block move is worth
inlining, but to do so requires aligned data. The MIPS requires aligned
data also, but the benefit of an inline block move on the MIPS is unclear:
to get maximum speed the loop should be unrolled, but this may take too
large a penalty in code space.

        --John Carr (
[On AIX 1.0, we went to some effort to do block moves with load and store
multiple. I'm surprised to hear that regular loads and stores are faster.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.