replacing multipication of constant by simpler instructions (LONG)

Youfeng Wu <wu@sequent.com>
Fri, 23 Oct 1992 20:10:15 GMT

          From comp.compilers

Related articles
replacing multipication of constant by simpler instructions (LONG) wu@sequent.com (Youfeng Wu) (1992-10-23)
| List of all articles for this month |
Newsgroups: comp.compilers
From: Youfeng Wu <wu@sequent.com>
Organization: Compilers Central
Date: Fri, 23 Oct 1992 20:10:15 GMT
Keywords: arithmetic, architecture

Replacing Multiplication of Integer Constant
by Sequence of SUB/ADD/SHIFT/LEA Instructions




1. Introduction


For pipelined or superscalar processors, multiplication usually is many
times slower than the simple instructions like add, sub, shift, or LEA
(shift by 1, 2, or 3, and add). Replacing a multiplication by a sequence
of simple instructions is beneficial because that it reduces the number of
cycles and provides more opportunity for instruction level parallelism.


For example, for the INTEL i486 processor, multiplication of a 4-bytes
integer takes 26 cycles, but add and sub take only one cycle and shift and
LEA take two cycles. If we replace reg0*136 by the following two
instructions,


reg = reg0 shift 7 *** reg0 left shift 7 bits
reg = LEA ( reg, X, 3) *** reg add ( reg0 left shift 3 bits )


the multiplication will take only 3 cycles rather than 26 cycles.


Since the speed discrepancy between the multiplication and the simple
instructions are so high, we can usually use a sequence of more than two
instructions and still gain performance improvement. For example, we can
use the following five instructions to perform reg0 * 1950 and save 17
cycles.


reg = LEA ( reg0, reg0, 1 )
reg = reg shift 5
reg = LEA ( reg, reg0, 1 )
reg1 = reg0 shift 11
reg = reg1 sub reg


In addition to the cycles savings, sequence of simple instructions are
RISC style of coding and are more suitable for instruction level
parallelism. For example, it is easy to insert instructions among the
sequence of simple instructions and run them in parallel on a superscalar
processor.


Here we focus our attention on converting multiplication by constant to
sub/add/shift/LEA instruction sequence. More powerful instructions are
not allowed. Also we assume the integer constant is positive and negative
number can be handled similarly as mentioned by Bernstein [2].


2. Problem Formulation.


If only add is allowed in the simple instruction sequence, a nice mathematic
formulation of the problem can be found in Knuth [1], namely the concept of
"addition chain". For a multiplication X * C, where C is a constant,
an "addition chain" for C is defined as a sequence of integers


1 = a0, a1, a2, ..., ar = C


that are generated by the rule


ai = aj + ak, for some j, k such that i > j >= k for all i in [1. r].


Once an addition chain is found, the calculation of X * C can be performed
by the sequence of additions:


a1 = X + X
...
ai = aj + ak i > j >= k
...
C = ar = aw + au r > w >= u.


For example, an addition chain for the number 9 can be 1 2 4 5 9 because
9 = 5+4, 5=4+1, 4=2+2, 2=1+1. From this addition chain, the following
sequence of additions can be generated to calculate reg1*9:


reg2 = reg1 + reg1
reg3 = reg2 + reg2
reg4 = reg3 + reg1
reg5 = reg4 + reg2


The problem to find the optimal sequence is equivalent to find the addition
chain with the minimal length, which is believed to be NP-complete (no time to
prove it yet).


In the case where add, sub, shift, and LEA instructions are all allowed,
Deniel Magenheimer [2] et el extended the concept of addition chain to
"sub/add/shift/LEA chain" (I will call it SASL-chain for short). That is,
the rule


ai = aj + ak, for some j, k such that i > j >= k for all i in [1. r].


is extended to


ai = aj + ak or
ai = aj - ak or
ai = aj << ak or
ai = LEA(ak, aj, 1) or
ai = LEA(ak, aj, 2) or
ai = LEA(ak, aj, 3) for some j, k such that i > j >= k
for all i in [1. r].


For example, an SASL-chain for 9 is 1, 9 because 9 = LEA(1, 1, 3).
Although including sub/shift/LEA instructions can reduce the length of
instruction sequence, it should increase the size of the search space for
the minimal sequence, and therefore the problem of finding the optimal
sequence becomes harder and should remain NP-complete. Any practical
solution to find a fast sequence for multiplication by constant should
using heuristics.


3. Heuristics solution of Robert Bernstein


Robert Bernstein [3] provided a nice "branch and bound" heuristics
searching for a good add/sub/shift sequence for constant multiplication.
We extended it to use LEA instructions as much as possible.
The extended algorithm can be summerized as follows.


Algorithm branch_bound (n)
        Input: an integer n, costs for add, sub, shift, and LEA instructions
        Output: an SASL-chain for multiplication by n and the associated cost:
( SASL-chain, cost of the chain).


        Process: In the following, chain.result stands for the intermediate
result after evaluating the SASL-chain "chain".


if ( n <= 1 )
return (nil, 0)
if ( bound condition meet, abort this search path )
return (nil, max_cost)
if ( n is even and n = m * 2**i where m is an odd number )
( chain, cost) = branch_bound ( m )
return ( [chain, chain.result << i], cost + cost of shift );
(chain1, cost1) = branch_bound ( n - 1 )
(chain2, cost2) = branch_bound ( n + 1 )
(chain3, cost3) = branch_bound ( n / (2**i + 1) ), i = 1, 2, or 3
(chain4, cost4) = branch_bound ( n / (2**i + 1) ), i > 3
(chain5, cost5) = branch_bound ( n / (2**i - 1) ), i > 0
if ( n == (2**i + 1) ), i = 1, 2, or 3
(chain6, cost6) = branch_bound ( 1 )


min_cost = min ( cost1, cost2, cost3, cost4, cost5, cost6 )
if ( min_cost = cost1 )
return ( [chain, chain.result + 1], cost + cost1 of add );
if ( min_cost = cost2 )
return ( [chain, chain.result - 1], cost + cost2 of sub );
if ( min_cost = cost3 )
return ( [chain, LEA(chain.result, chain.result, i)],
cost3 + cost of LEA );
if ( min_cost = cost4 )
return ( [chain, chain.result << i + chain.result],
cost4 + cost of shift + cost of add );
if ( min_cost = cost5 )
return ( [chain, chain.result << i - chain.result ],
cost5 + cost of shift + cost of sub );
if ( min_cost = cost6 )
return ( [chain, LEA(chain.result, chain.result, i)],
cost6 + cost of LEA );


This heuristics can discover many good sequences, and in many case the
result is optimal. For example, Preston pointed that for number between 1
to 1000, the original Bernstein's algorithm requires a total of 5512
cycles whereas the optimal sequences require 5039".


But the extended Bernstein algorithm using LEA still didn't take the full
advantages of LEA instruction. An example of non-optimal case is X*29,
which need the sequence of (even with the extension to use LEA
instruction):


8 = 1 << 3 reg1 = X << 3
7 = 8 - 1 reg2 = reg1 - X
28 = 7 << 2 reg3 = reg2 << 2
29 = 28 + 1 reg4 = reg3 + X


while we notice that 29 = 32 - 3 so the following three instructions are
sufficient:


reg1 = LEA(X, X, 1)
reg2 = X << 5
reg3 = reg2 - reg1


We will compare this extended Bernstein algorithm using LEA with our
algorithm later.


4. Heuristics solution of Deniel Magenheimer [2] et el.


A rule based heuristics was used in Deniel Magenheimer [2] et el, which
takes full advantage of LEA instructions.


The advantage of the rule based system is that many small tricks we human
do can be specified as rules and they may be picked up by the inference
engine. Deniel Magenheimer [2] et el. mentioned that the rule based
approach generated optimal sequences for all numbers under 10,000 except
12. We believe that can be done and it should not be too difficult to add
more rules to cover the 12 cases too. But we speculate that the rule
bases approach may not be as fast as a procedural approach. Also it may
not be cost-effective to include a rule based system into an existing
compiler solely for the purpose of replacing multiplication by constants.
This situation may change since people are proposing rule based approach
for several other compiler optimizations ([ref]).


5. Our Heuristics exploring LEA instructions


LEA instruction is very powerful. It potentially can perform two
operations in one instruction and should be explored as much as possible.
Assume LEA (B,I,S) = B + I << S, S = 1, 2, ..., MAX_S.


5.1. Basic algorithm using LEA


Each integer constant N can be represented as a binary sequence


N = sum ( ai*2**i | i = 0, ..., k ) where ai = either 0 or 1.


To use LEA, we note that the above number can be represented recursively
as the following, assuming ai and aj are the first two bits that are ares:


N = 2**i ( 1 + 2**(j-i) * N' )
where N' = ( N - 2**i ) >> j has one less 1 bits than N.


If (j-i) <= MAX_S, N can be calculated by calculating N' followed by
an LEA instruction and possibly a shift instruction, namely,


N' = ...
temp = LEA(1, N', (j-i) )
N = temp << i


If i == 0 (i.e. a0 = 1), the last shift can be removed.
Note that, if (j-i) <= MAX_S, N' will have a0' = 1, so a shift
is not needed after calculating N'.


If (j-i) > MAX_S, then


N = 2**i ( 1 + 2**MAX_S * N' )
where N' = ( ( N - 2**i ) >> (MAX_S + i) ) has one less 1 bits than N.


and N can be calculated as following:


N' = ...
temp = LEA(1, N', MAX_S )
N = temp << i


In this case, N' will have a0' = 0, so a shift may be needed
after calculating N'.


Assume N has r ones, t instances in which the consecutive ones are more
than MAX_S apart. This basic algorithm will use r-1 LEA, t+a0 shifts for
multiplying N. Namely a total of r - 1 + t + a0 LEA and shift
instructions.


An improvement can be made to the above procedure when we notice
that if 1 <= i <= MAX_S,


N = 2**i + N'
where N' = ( N - 2**i ) has one less 1 bits than N.


and N can be calculated as following:


N' = ...
temp = LEA(N', 1, i)


This will save the shift instruction in the above procedure when
1 <= i <= MAX_S.


Before describing the basic algorithm, we introduce the following
"condensed representation" of integer numbers which simplifies the
implementation of the above procedure.


For an integer number N, assume the number of ones in its binary
representation is r. We may mark the ones in the binary sequence, and
assume they are at locations:


L0, L1, ..., Lr-1, where Li+1 > Li >= 0


The condensed representation of the number N is


N = R0, R1, ..., Rr-1


where R0 = L0, and Ri = Li - Li-1, for i = 1, ..., r-1.


The basic algorithm can be described as follows.


Algorithm basic_decompose_mul ( R, r, N )
        Input: N - the constant to be multiplied
r - the number of 1 in the constant to be processed.
R[0...r-1] - the condensed representation of the constant.
        Output: emit LEA/shift instructions for multiplying the constant
and return the root of the instruction tree emitted.
        Process:
if ( R[0] > MAX_S ) {
r0 = R[0];
R[0] = 0;
rslt = decompose_mul ( R, r );
rslt = emit_shift ( rslt, r0 );
} else if ( R[0] > 0 ) {
r0 = R[0];
R[1] += r0;
rslt = decompose_mul ( R[1..r-1], r - 1 );
rslt = emit_LEA ( rslt, 1, r0 );
} else {
r1 = R[1] > MAX_S ? MAX_S : R[1];
R[1] -= r1;
rslt = decompose_mul ( R[1..r-1], r - 1 );
rslt = emit_LEA ( 1, rslt, r1 );
}
return rslt


Since the number of instructions generated by this algorithm is in
proportion to the number of ones in N, we should try to reduce the
number of ones when possible.


5.2. Generalized complementary scheme


The above recursive procedure can be improved to convert a sequence of
one's to a single one with a subtraction (if it is safe to do so). For
example, X consecutive 1's can be converted to 2**X - 1. This
complementary approach can reduce the number of ones very quickly. Since
2**X - 1 takes two instructions, X should be at least three for this
complementary scheme to be beneficial.


This method can be extended to a sequence of 0 and 1's as long
as the final sequence takes less instructions. Assume


N = sum ( ai*2**i | i = 0, ..., k )
where ai = either 0 or 1, and 2**k < N < 2**(k+1)


then
2**(k+1) = sum ( (ai + ~ai)*2**i | i = 0, ..., k ) + 1
= N + sum ( (~ai)*2**i | i = 0, ..., k ) + 1


That is,


N = 2**(k+1) - N'
where N' = sum ( (~ai)*2**i | i = 0, ..., k ) + 1


Note that sum ( (~ai)*2**i | i = 0, ..., k ) is the number of zeros
in N before the last one.


So if N has r 1's and t zeros among the ones, r - t - 2 will be the
number of instructions that will be saved using the conversion.


The generalized complementary scheme is to replace N with 2**(k+1) - N'
when


r - t - 2 > 0.


When t = 0, we get back to the case for consecutive 1's.


This generalized scheme can be incorporated into the above recursive
procedure very nicely. Namely, at the beginning of each step of
recursion, check N to identify a prefix that maximizes r - t - 2 > 0. If
such a prefix is found, say N = N1 + N2, where N1 = 2**(k+1) - N1',
then N can be calculated as following:


N1' = ...
temp = N2 + 2**(k+1)
N = temp - N1'


5.3. Factoring to use LEA


If N is a multiple of ( 1 + 2**i ), for i = 1, 2, ..., or MAX_S, namely N can
be defined as


N = ( 1 + 2**i ) * N'
where N' = N / ( 1 + 2**i )


then the multiplication by N can be calculated as:


N' = ...
N = LEA ( N', N', i )


In most cases, N' will have much less number of 1 bits than N so the above
factoring will reduce the number of 1 bits very rapidly. But that is not
always true. There are cases for which N' has more 1's than N. So it is
potentially possible that factoring may do worse than without factoring.
An example is 290 = 58 * 5, where 290 = 2**1 + 2**5 + 2**8, while 58 =
2**1 + 2**3 + 2**4 + 2**5.


So we apply factoring only when N' has less one bits than N has.


5.4. Combined algorithm


The compiled procedure for replacing multiplication of integer N with a
sequence of sub/add/shift/LEA uses all of the three techniques mentioned
in 5.1, 5.2, and 5.3. For a number N, it first checks if it has a prefix
sub-sequence of 0 and 1's that can be complemented to reduce the number of
ones in N. If so apply the complementary scheme with N' with less ones.
If not, it checks if N can be factored to use LEA and reduce number of
ones. If so apply the factoring scheme. If not the basic algorithm is
used. This can be described as follows:


Algorithm decompose_mul( R, r, N )
        Input: N - the constant to be multiplied
r - the number of 1 in the constant to be processed.
R[0...r-1] - the condensed representation of the constant.
        Output: emit LEA/shift instructions for multiplying the constant
and return the root of the instruction tree emitted.


        Process:


if ( r <= 2 )
return evaluate_simple_cases(R, r, N);


if ( complementary scheme applied )
return its result


if ( factoring scheme applied )
return its result


return basic_decompose_mul ( R, r, N );


5.5. Performance result


The above algorithm combines three techniques to use LEA and reduce the
number of simple instructions for multiplication of constant. Because it
reduces the number of ones of the input number by at least one in each
iteration, it runs very fast. Table 1 compares the execution time (on an
i486 system) of our algorithm (ALG1) with Bernstein's algorithm
implemented by Preston Briggs with our extension to use LEA. Both user
time and system time are listed with system time in parenthesis.


The execution time of ALG1 is farely uniform, roughly at 50/1,000,000
second per multiplication, and the amount of system time is ignorable. On
the other hand, the execution of of ALG2 seemingly grows more than
linearly as the number of constant increases. Also, ALG2 uses a
significant amount of system time overhead (we have more than doubled the
hash table size to speed up the search time in ALG2).


Table 1. Execution time comparison (0.1 sec)


NUMBER RANGES ALG1 ALG2
===============================================
1 100000 42.7 (0.1) 40.7 (36.3)
1 200000 92.2 (0.1) 149.7 (184.6)
1 300000 145.3 (0.1) 309.2 (296.7)
1 400000 199.1 (0.1) 523.0 (357.6)
1 500000 255.6 (0.1) 777.7 (390.8)
1 600000 312.9 (0.2) 1089.1 (394.7)
1 700000 371.4 (1.2) 1456.8 (396.0)
1 800000 428.0 (0.4) 1893.9 (399.2)
1 900000 488.4 (0.2) 2356.3 (408.3)
1 1000000 548.9 (0.2) 2882.3 (440.1)


The following compares the quality of the generated instruction
sequences.


For simplicity, we assume each of the simple instructions takes
only one cycle. Thus we can compare the goodness of instruction
sequence by simply comparing how many instructions it uses.
Table 2 compares the number of instructions used by our combined
algorithm and by Bernstein's algorithm implemented by Preston
Briggs with our extension to use LEA.


ALG2 constantly uses more (around 7%) instructions than
ALG1. Furthermore the difference is higher for small numbers
than for big numbers.


Table 2. Number of instructions


NUMBER RANGES ALG1 ALG2 % diff
======================================================
1 100000 679904 740311 8.885%
1 200000 1446258 1565844 8.269%
1 300000 2247738 2422837 7.790%
1 400000 3065921 3301085 7.670%
1 500000 3921855 4197590 7.031%
1 600000 4755289 5098608 7.220%
1 700000 5604671 6012433 7.275%
1 800000 6478143 6938681 7.109%
1 900000 7369242 7871300 6.813%
1 1000000 8275983 8813951 6.500%


The following summerizes the relative effects of each of three techniques
in our algorithm on the number of instructions used. The basic algorithm
for exploring LEA is referred to as STEP1, STEP1 with generalized
complementary scheme is referred to as STEP2, and STEP2 with factoring for
LEA is referred to as STEP3. For each range of numbers, Table 3 shows the
difference between the total number 1 bits and the number of the
instructions used. The bigger is the difference, the better is the
techniques since that means less instructions are used. Cleanly, the
performance of the algorithm gets better as it evolves from STEP1 to STEP2
and then to STEP3.


Table 3. Performance results of the algorithm


NUMBER RANGES STEP1 STEP2 STEP3
================================================================
00000001 10000000 -14631598 3546659 17975029
10000001 20000000 -16631634 3574258 19228199
20000001 30000000 -16412241 3235187 20628282
30000001 40000000 -18850950 3908062 19065351
40000001 50000000 -17188144 3869904 21889975
50000001 60000000 -17636373 2596056 20623154
60000001 70000000 -16601799 12169129 25227637
70000001 80000000 -23099969 -4279395 14079310
80000001 90000000 -20917643 1244995 20105549
90000001 100000000 -15458721 6554094 24913372




5.6. Preventing overflows


Converting multiplication by constant to a sequence of sub/add shift/LEA
instructions may introduce runtime overflow. This can be prevented by
excluding sub instruction from the sequence. Since sub instruction is
only used in the generalized complementary scheme, we added a compiler
option which disables that operation when it is specified.


6. Acknowledgment


I would like to thank Rocky Bernstein, Preston Briggs, Cliff Click, Hans
Olsson, Thomas David Rivers, and many other people from the net for
references, insights, and discussions on the topic of multiplication by
constant. Special thanks to Preston Briggs for his implementation of
Bernstein algorithm which is used in our performance comparison.


References


[1] Knuth, The Art of Computer Programming, Volume 2/Seminumerical
Algorithms. pp 402-422.


[2] Robert Bernstein, "Multiplication by Integer Constants,"
Software -- Practice and Experience, Vol 16, No. 7, Jul, 1986. pp 641-652


[3] Deniel Magenheimer, et. al, "Multiplication and Division on
the HP Precision," Proceedings of ASPLOS II. pp. 90-99.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.