Related articles |
---|
32-bit vs. 64-bit x86 Speed jlforrest@berkeley.edu (Jon Forrest) (2007-04-11) |
Re: 32-bit vs. 64-bit x86 Speed gah@ugcs.caltech.edu (glen herrmannsfeldt) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed marcov@stack.nl (Marco van de Voort) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed anton@mips.complang.tuwien.ac.at (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed ian.rogers@manchester.ac.uk (Ian Rogers) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed meissner@the-meissners.org (Michael Meissner) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed georgeps@xmission.com (George Peter Staplin) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed tmk@netvision.net.il (Michael Tiomkin) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed dot@dotat.at (Tony Finch) (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed kenney@cix.compulink.co.uk (2007-04-13) |
Re: 32-bit vs. 64-bit x86 Speed DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-14) |
Re: 32-bit vs. 64-bit x86 Speed DrDiettrich1@aol.com (Hans-Peter Diettrich) (2007-04-14) |
[7 later articles] |
From: | Michael Meissner <meissner@the-meissners.org> |
Newsgroups: | comp.compilers |
Date: | 13 Apr 2007 01:34:18 -0400 |
Organization: | Compilers Central |
References: | 07-04-031 |
Keywords: | architecture, performance |
Posted-Date: | 13 Apr 2007 01:34:18 EDT |
Jon Forrest <jlforrest@berkeley.edu> writes:
> I don't think it's worth messing with 64-bit computing for apps that
> don't need the address space."
>
> Let's say you're a Linux user who never needs to run programs that
> don't fit in 32-bits. Would you run a 32-bit or a 64-bit version of
> Linux? You compiler people probably have intimate knowledge of the ISA
> issues here so I'm interested in what you have to say.
>
> Cordially,
Note, the following are my opinions, and not that of my employer, nor of the
Free Software Foundation in terms of gcc support.
In general, the answer to this kind of question is measure your own app,
because there are many, many, many different variables that can affect the
performance.
Note, most Linux 64-bit distributions will allow you to run and develop 32-bit
applications as well as 64-bit applications, so you can mix and match
(Debian/Ubuntu systems don't by default provide the 32-bit libraries, but you
can install them later). Using gcc, the switch to compile code for 32-bit is
-m32.
Off the top of my head, some of the differences include:
1) What compiler do you use? Different compilers provide different amounts of
optimization, and even if you stick to the GCC provided by your
distribution, different distributions will ship newer or older compilers.
2) Which compiler options do you use? Some optimizations are not turned on by
default, even with say -O2. Investigate which options your compiler has
available. For example, gcc's vectorization support (-ftree-vectorize) is
not yet turned on by -O2. Whole program optimization and profile guided
feedback are other options that should be explored.
3) What is the underlying chip architecture and memory subsystem? Different
cores have different optimization strategies at the low level. This varies
both by the manufacturer of the chips (typically AMD and Intel) but also
within the different generations of chips that were made by the same chip
vendor. The memory system and amount of cache on the particular chip can
also play a big difference. Where you need every last ounce of speed, you
need to compile your code targeting the specific chip you are running on
(-march=<machine> under GCC), so that the compiler can better tune the
instructions. Newer chip architectures also add new instructions that the
compiler can take advantage of, but generic code can't use since it is the
least common denominator.
4) Do you use the x87 floating point stack or xmm registers? In 32-bit mode,
the default is to use the x87 floating point stack instead of using the xmm
registers. As its name implies the x86 floating point stack is not random
access, and the compiler has to rearrange code to accomidate the pushing and
popping. If your code vectorizes, the xmm registers have the possibility of
doing 2 or 4 opterations in parallel, though it depends on the low level
details of the chip if vectorization is much faster than scalar code. In
addition by default the x87 stack does its calculations in 80-bit, which can
mean slower multiplies and divides. Newer gcc's have options to use the xmm
registers in 32-bit mode, but you must not use that moldy old i386 you have
lying around.
5) Do you use of tuned math and string libraries? Not all distributions have
tuned math/string libraries, and even in the case of distributions that do
have tuned math/string libraries, the libraries are often times tuned for a
generic machine. You can get better tuned libraries from your chip vendor
that can greatly improve performance. Note even in the case of default
libraries, things like the floating point stack and calling sequence can
affect things.
6) What is the ABI used for calling functions? The standard 32-bit ABI passes
all arguments in memory, while in 64-bit, the first few arguments are passed
in registers. Particularly for small leaf functions, this can avoid stores
and loads to access the arguments.
7) Number of registers available? Some functions will be able to use more
registers if they are available. As you mention, in 64-bit mode, there are
double the number of GPRs and XMMs that the compiler can use, which means
the compiler is less likely to need to spill a register. The downside is to
access these registers, you need one more byte for the REX prefix to specify
the bit for the high register.
8) Do you use position independent code? If you make shared libraries, the
64-bit PIC code is more efficient than the 32-bit code, since the %rip
register allows PC-relative references directly, while in 32-bit mode, the
code has to do a CALL instruction to get the current address, and use a
register to hold the address.
9) What is your d-cache effectiveness? Because 32-bit code has smaller
pointers and integers, the d-cache is more effective. It depends on the
program whether you are filling the cache on a particular chipset or not.
Some of the Spec benchmarks are better in 32-bit mode because of the cache.
10) Do you use software prefetching for vector codes? Depending on the chip
architecture, software prefetching can make quite a difference in
performance.
11) What boundary is the code in your loops aligned to? 64-bit compilers might
align the code for loops to higher boundaries than 32-bit, which makes the
program slightly bigger, but can speed up processing that loop.
--
Michael Meissner
email: mrmnews@the-meissners.org
http://www.the-meissners.org
Return to the
comp.compilers page.
Search the
comp.compilers archives again.