13 Oct 1999 00:57:42 -0400

Related articles |
---|

Modulo optimizations gmarshal@globalnet.co.uk (Graham Marshall) (1999-10-04) |

Re: Modulo optimizations jgk@jgk.org (Joe Keane) (1999-10-11) |

Re: Modulo optimizations torbenm@diku.dk (1999-10-13) |

Re: Modulo optimizations chase@world.std.com (David Chase) (1999-10-13) |

Re: Modulo optimizations ger@informatik.uni-bremen.de (George Russell) (1999-10-14) |

Re: Modulo optimizations harley@corton.inria.fr (Robert Harley) (1999-10-14) |

Re:Modulo optimizations Wilco.Dijkstra@arm.com (Wilco Dijkstra) (1999-10-19) |

Re: Modulo optimizations harley@corton.inria.fr (Robert Harley) (1999-10-21) |

Re: Modulo optimizations Peter-Lawrence.Montgomery@cwi.nl (1999-10-27) |

From: | torbenm@diku.dk (Torben AEgidius Mogensen) |

Newsgroups: | comp.compilers |

Date: | 13 Oct 1999 00:57:42 -0400 |

Organization: | Department of Computer Science, U of Copenhagen |

References: | 99-10-017 |

Keywords: | arithmetic, performance |

That modulo with powers of two can be done for unsigned numbers by

bitwise AND is a standard trick and done by many compilers. For signed

numbers, the problem is that most languages say that modulos of

negative numbers is negative, for eaxmple (-7)%3 == -1, so you need a

test for these. Hence, it is not a clear win for machines that have

division hardware.

For machines without division hardware, you can also do fast(ish)

modulo (2^n-1), using a method similar to the way we find modulo 3 or

9 of decimal numbers by summing digits. This also is for unsigned

numbers only. For x%3, you add all bit-pairs and reduce this sum again

in the same way until you are down to two bits. You can do some of the

summing in parallel, i.e.:

a = x & 0x33333333; /* even two-bit groups */

b = x & 0xcccccccc; /* odd two-bit groups */

sum = a + (b >> 2); /* sum 0-6 in 8 groups */

sum = sum + (sum >> 2); /* sum 0-3 in 8 groups */

sum = sum & 0x33333333; /* clear garbage bits */

sum = sum + (sum >> 4); /* sum 0-6 in 4 groups */

sum = sum + (sum >> 2); /* sum 0-3 in 4 groups */

sum = sum & 0x33333333; /* clear garbage bits */

sum = sum + (sum >> 8); /* sum 0-6 in 2 groups */

sum = sum + (sum >> 2); /* sum 0-3 in 2 groups */

sum = sum & 0x33333333; /* clear garbage bits */

sum = sum + (sum >> 16); /* sum 0-6 in 1 group */

sum = sum + (sum >> 2); /* sum 0-3 in 1 group */

sum = sum & 0x3; /* clear garbage bits */

After this (if I haven't made any mistakes along the way), sum

contains x%3. On a processor that can do scaled addition (e.g., ARM),

each lina above can be done in one instruction (assuming the constants

used are set up in registers beforehand). On ARM7/StrongARM, this will

take 14 cycles which, while not lightning-fast, is a good deal faster

than full division by software. The same trick can be done for other

(2^n-1) and (mostly) in fewer instructions since there are fewer

groups to add.

Torben Mogensen (torbenm@diku.dk)

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.