Re: 32-bit vs. 64-bit x86 Speed

Michael Meissner <>
13 Apr 2007 01:34:18 -0400

          From comp.compilers

Related articles
32-bit vs. 64-bit x86 Speed (Jon Forrest) (2007-04-11)
Re: 32-bit vs. 64-bit x86 Speed (glen herrmannsfeldt) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (Marco van de Voort) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (Hans-Peter Diettrich) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (Ian Rogers) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (Michael Meissner) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (George Peter Staplin) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (Michael Tiomkin) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (Tony Finch) (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (2007-04-13)
Re: 32-bit vs. 64-bit x86 Speed (Hans-Peter Diettrich) (2007-04-14)
Re: 32-bit vs. 64-bit x86 Speed (Hans-Peter Diettrich) (2007-04-14)
[7 later articles]
| List of all articles for this month |

From: Michael Meissner <>
Newsgroups: comp.compilers
Date: 13 Apr 2007 01:34:18 -0400
Organization: Compilers Central
References: 07-04-031
Keywords: architecture, performance
Posted-Date: 13 Apr 2007 01:34:18 EDT

Jon Forrest <> writes:

> I don't think it's worth messing with 64-bit computing for apps that
> don't need the address space."
> Let's say you're a Linux user who never needs to run programs that
> don't fit in 32-bits. Would you run a 32-bit or a 64-bit version of
> Linux? You compiler people probably have intimate knowledge of the ISA
> issues here so I'm interested in what you have to say.
> Cordially,

Note, the following are my opinions, and not that of my employer, nor of the
Free Software Foundation in terms of gcc support.

In general, the answer to this kind of question is measure your own app,
because there are many, many, many different variables that can affect the

Note, most Linux 64-bit distributions will allow you to run and develop 32-bit
applications as well as 64-bit applications, so you can mix and match
(Debian/Ubuntu systems don't by default provide the 32-bit libraries, but you
can install them later). Using gcc, the switch to compile code for 32-bit is

Off the top of my head, some of the differences include:

1) What compiler do you use? Different compilers provide different amounts of
      optimization, and even if you stick to the GCC provided by your
      distribution, different distributions will ship newer or older compilers.

2) Which compiler options do you use? Some optimizations are not turned on by
      default, even with say -O2. Investigate which options your compiler has
      available. For example, gcc's vectorization support (-ftree-vectorize) is
      not yet turned on by -O2. Whole program optimization and profile guided
      feedback are other options that should be explored.

3) What is the underlying chip architecture and memory subsystem? Different
      cores have different optimization strategies at the low level. This varies
      both by the manufacturer of the chips (typically AMD and Intel) but also
      within the different generations of chips that were made by the same chip
      vendor. The memory system and amount of cache on the particular chip can
      also play a big difference. Where you need every last ounce of speed, you
      need to compile your code targeting the specific chip you are running on
      (-march=<machine> under GCC), so that the compiler can better tune the
      instructions. Newer chip architectures also add new instructions that the
      compiler can take advantage of, but generic code can't use since it is the
      least common denominator.

4) Do you use the x87 floating point stack or xmm registers? In 32-bit mode,
      the default is to use the x87 floating point stack instead of using the xmm
      registers. As its name implies the x86 floating point stack is not random
      access, and the compiler has to rearrange code to accomidate the pushing and
      popping. If your code vectorizes, the xmm registers have the possibility of
      doing 2 or 4 opterations in parallel, though it depends on the low level
      details of the chip if vectorization is much faster than scalar code. In
      addition by default the x87 stack does its calculations in 80-bit, which can
      mean slower multiplies and divides. Newer gcc's have options to use the xmm
      registers in 32-bit mode, but you must not use that moldy old i386 you have
      lying around.

5) Do you use of tuned math and string libraries? Not all distributions have
      tuned math/string libraries, and even in the case of distributions that do
      have tuned math/string libraries, the libraries are often times tuned for a
      generic machine. You can get better tuned libraries from your chip vendor
      that can greatly improve performance. Note even in the case of default
      libraries, things like the floating point stack and calling sequence can
      affect things.

6) What is the ABI used for calling functions? The standard 32-bit ABI passes
      all arguments in memory, while in 64-bit, the first few arguments are passed
      in registers. Particularly for small leaf functions, this can avoid stores
      and loads to access the arguments.

7) Number of registers available? Some functions will be able to use more
      registers if they are available. As you mention, in 64-bit mode, there are
      double the number of GPRs and XMMs that the compiler can use, which means
      the compiler is less likely to need to spill a register. The downside is to
      access these registers, you need one more byte for the REX prefix to specify
      the bit for the high register.

8) Do you use position independent code? If you make shared libraries, the
      64-bit PIC code is more efficient than the 32-bit code, since the %rip
      register allows PC-relative references directly, while in 32-bit mode, the
      code has to do a CALL instruction to get the current address, and use a
      register to hold the address.

9) What is your d-cache effectiveness? Because 32-bit code has smaller
      pointers and integers, the d-cache is more effective. It depends on the
      program whether you are filling the cache on a particular chipset or not.
      Some of the Spec benchmarks are better in 32-bit mode because of the cache.

10) Do you use software prefetching for vector codes? Depending on the chip
        architecture, software prefetching can make quite a difference in

11) What boundary is the code in your loops aligned to? 64-bit compilers might
        align the code for loops to higher boundaries than 32-bit, which makes the
        program slightly bigger, but can speed up processing that loop.

Michael Meissner

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.