# Re: Modulo optimizations

## torbenm@diku.dk (Torben AEgidius Mogensen)13 Oct 1999 00:57:42 -0400

From comp.compilers

Related articles
Modulo optimizations gmarshal@globalnet.co.uk (Graham Marshall) (1999-10-04)
Re: Modulo optimizations jgk@jgk.org (Joe Keane) (1999-10-11)
Re: Modulo optimizations torbenm@diku.dk (1999-10-13)
Re: Modulo optimizations chase@world.std.com (David Chase) (1999-10-13)
Re: Modulo optimizations ger@informatik.uni-bremen.de (George Russell) (1999-10-14)
Re: Modulo optimizations harley@corton.inria.fr (Robert Harley) (1999-10-14)
Re:Modulo optimizations Wilco.Dijkstra@arm.com (Wilco Dijkstra) (1999-10-19)
Re: Modulo optimizations harley@corton.inria.fr (Robert Harley) (1999-10-21)
Re: Modulo optimizations Peter-Lawrence.Montgomery@cwi.nl (1999-10-27)
| List of all articles for this month |

 From: torbenm@diku.dk (Torben AEgidius Mogensen) Newsgroups: comp.compilers Date: 13 Oct 1999 00:57:42 -0400 Organization: Department of Computer Science, U of Copenhagen References: 99-10-017 Keywords: arithmetic, performance

That modulo with powers of two can be done for unsigned numbers by
bitwise AND is a standard trick and done by many compilers. For signed
numbers, the problem is that most languages say that modulos of
negative numbers is negative, for eaxmple (-7)%3 == -1, so you need a
test for these. Hence, it is not a clear win for machines that have
division hardware.

For machines without division hardware, you can also do fast(ish)
modulo (2^n-1), using a method similar to the way we find modulo 3 or
9 of decimal numbers by summing digits. This also is for unsigned
numbers only. For x%3, you add all bit-pairs and reduce this sum again
in the same way until you are down to two bits. You can do some of the
summing in parallel, i.e.:

a = x & 0x33333333; /* even two-bit groups */
b = x & 0xcccccccc; /* odd two-bit groups */
sum = a + (b >> 2); /* sum 0-6 in 8 groups */
sum = sum + (sum >> 2); /* sum 0-3 in 8 groups */
sum = sum & 0x33333333; /* clear garbage bits */
sum = sum + (sum >> 4); /* sum 0-6 in 4 groups */
sum = sum + (sum >> 2); /* sum 0-3 in 4 groups */
sum = sum & 0x33333333; /* clear garbage bits */
sum = sum + (sum >> 8); /* sum 0-6 in 2 groups */
sum = sum + (sum >> 2); /* sum 0-3 in 2 groups */
sum = sum & 0x33333333; /* clear garbage bits */
sum = sum + (sum >> 16); /* sum 0-6 in 1 group */
sum = sum + (sum >> 2); /* sum 0-3 in 1 group */
sum = sum & 0x3; /* clear garbage bits */

After this (if I haven't made any mistakes along the way), sum
contains x%3. On a processor that can do scaled addition (e.g., ARM),
each lina above can be done in one instruction (assuming the constants
used are set up in registers beforehand). On ARM7/StrongARM, this will
take 14 cycles which, while not lightning-fast, is a good deal faster
than full division by software. The same trick can be done for other
(2^n-1) and (mostly) in fewer instructions since there are fewer