# Re: Any references for efficient division code sequences?

## David Chase <David.Chase@naturalbridge.com>3 Jul 2001 23:13:26 -0400

From comp.compilers

Related articles
Any references for efficient division code sequences? shankar@webnexus.com (Shankar Unni) (2001-06-28)
Re: Any references for efficient division code sequences? torbenm@diku.dk (2001-07-02)
Re: Any references for efficient division code sequences? rv3s@cobra.cs.virginia.edu (Raja Venkateswaran) (2001-07-02)
Re: Any references for efficient division code sequences? David.Chase@naturalbridge.com (David Chase) (2001-07-03)
Re: Any references for efficient division code sequences? kadiyala@home.com (Suresh Kadiyala) (2001-07-17)
| List of all articles for this month |

 From: David Chase Newsgroups: comp.compilers Date: 3 Jul 2001 23:13:26 -0400 Organization: Compilers Central References: 01-06-060 Keywords: arithmetic Posted-Date: 03 Jul 2001 23:13:25 EDT

Shankar Unni wrote:
> Specifically for dividing by small constants (< 16). We're working on a
> small MIPS-like network processor that has no MUL/DIV unit, and need to
> be able to do a *fast* mod by small numbers.

> Are there any references that discuss code sequences for dividing (or
> specifically in our case, remaindering) integers by small constants (for
> any processor that doesn't have a MUL/DIV unit)?

References for your situation, I'm not sure off hand.

What I have done in the past for computing results
modulo-N is precompute the largest N * (2**M), and then
iteratively subtract. For example:

unsigned int modulo_small_odd_n(unsigned int x, unsigned int n) {
unsigned int largest_multiple = lm[n];
while (largest_multiple > n) {
if (largest_multiple >= x) x -= largest_multiple;
largest_multiple = largest_multiple >> 1;
}
return x;
}

This is going to give you nearly wordsize iterations.

Another way of doing this is with two (or more) tables.
Since your small constant is less than 16, you could use
16 sets of 4 tables of 256 bytes each (16 k total), where
(for 1 <= x <= 16, 0 <= y <= 255)

T_x_3[y] = (y << 24) mod x
T_x_2[y] = (y << 16) mod x
T_x_1[y] = (y << 8) mod x
T_x_0[y] = (y << 0) mod x

Given an input z, z mod x is

T_x_0[ T_x_0[z & 255] +
T_x_1[(z >> 8) & 255] +
T_x_2[(z >> 16) & 255] +
T_x_3[(z >> 24) & 255] ]

This approach works for x <= 64, because the maximum value of the sum
is 4 times (x-1) which must be smaller than 256.

The tables aren't really that large, since you can share some of the
tables. I got bored and wrote a Java program to compute them all;
there's 31 256-byte tables, or less than 8k total.

I got more bored, and wrote a benchmark, and the official no-remainder
code (included below, minus the table initializations, which are left
as an exercise to the reader or available on request) is about 25%
faster than computing with the hardware (when the hardware is an
800Mhz Pentium 3) (I am thinking to myself that I know of a few places
where this might be useful in my own code. Note that this could
trivially be extended to 64 bits for a modulus as large as 32.)

Now, once you have modulus, if you are dealing with a small range of
integers and want division, then you can subtract out the modulus, and
replace the division with a (possible) shift and a multiply by the
(odd) divisor's inverse in the ring 2**N. (This is different from the
inverse in the Alverson paper.)

Or, you can modify the iterative division presented at the top.
Neither of these choices is likely to be faster than computing both
quotient and remainder with a single machine instruction, on those
machines that support it.

Table driven mod follows:

static unsigned char * mod_tables [17][4] = {
{table_1_0,table_1_0,table_1_0,table_1_0}, /* Dummy table */
{table_1_0,table_1_0,table_1_0,table_1_0},
{table_2_0,table_1_0,table_1_0,table_1_0},
{table_3_0,table_3_0,table_3_0,table_3_0},
{table_4_0,table_1_0,table_1_0,table_1_0},
{table_5_0,table_5_0,table_5_0,table_5_0},
{table_6_0,table_6_1,table_6_1,table_6_1},
{table_7_0,table_7_1,table_7_2,table_7_0},
{table_8_0,table_1_0,table_1_0,table_1_0},
{table_9_0,table_9_1,table_9_2,table_9_0},
{table_10_0,table_10_1,table_10_1,table_10_1},
{table_11_0,table_11_1,table_11_2,table_11_3},
{table_12_0,table_12_1,table_12_1,table_12_1},
{table_13_0,table_13_1,table_13_2,table_13_0},
{table_14_0,table_14_1,table_14_2,table_14_3},
{table_15_0,table_15_0,table_15_0,table_15_0},
{table_16_0,table_1_0,table_1_0,table_1_0}
};

int mod(unsigned int x, unsigned int m) {
int x0 = 255 & (x >> 0);
int x1 = 255 & (x >> 8);
int x2 = 255 & (x >> 16);
int x3 = 255 & (x >> 24);
unsigned char * t0 = mod_tables[m][0];
unsigned char * t1 = mod_tables[m][1];
unsigned char * t2 = mod_tables[m][2];
unsigned char * t3 = mod_tables[m][3];

return t0[t0[x0]+t1[x1]+t2[x2]+t3[x3]];
}

You can shrink this slightly (removing table_{2,4,8,16}_0 ) and
improve performance slightly (assuming a uniform distribution in the
1-16 range) by prechecking for modulus by a power of two:

int modB(unsigned int x, unsigned int m) {
unsigned int mask = m-1;
if (m == (m & ~mask)) {
return x & m;
} else {
...

The timings on my machine for many 115,200,000 mods (this includes

1. straight remainder 8.9s (62 cycles)
2. simple table 6.8s (47 cycles)
3. check + table/mask 5.9s (41 cycles)

Do note, this is for unsigned arithmetic. Doing this for signed
arithmetic is messier. The modified code is:

int modC(unsigned int x, unsigned int m) {
int mask = (int) x >> 31;
x = (x ^ mask) - mask;
{
int x0 = 255 & (x >> 0);
int x1 = 255 & (x >> 8);
int x2 = 255 & (x >> 16);
int x3 = 255 & (x >> 24);
unsigned char * t0 = mod_tables[m][0];
unsigned char * t1 = mod_tables[m][1];
unsigned char * t2 = mod_tables[m][2];
unsigned char * t3 = mod_tables[m][3];

return ((t0[t0[x0]+t1[x1]+t2[x2]+t3[x3]]) ^ mask) - mask ;
}
}

With timings

4. table + signage 8.5s
5. signed remainder 8.7s

Notice that (for whatever reason, could be compilation artifacts) the
signed remainder instruction is faster than the unsigned remainder
instruction. The table is still a hair faster, but only a hair with