| Related articles |
|---|
| [11 earlier articles] |
| Re: Undefined behaviour in C23 antispam@fricas.org (2025-08-23) |
| Re: Undefined behaviour in C23 Keith.S.Thompson+u@gmail.com (Keith Thompson) (2025-08-23) |
| Re: Undefined behaviour in C23 jameskuyper@alumni.caltech.edu (James Kuyper) (2025-08-25) |
| Re: Undefined behaviour in C23 jameskuyper@alumni.caltech.edu (James Kuyper) (2025-08-26) |
| Re: Undefined behaviour in C23 already5chosen@yahoo.com.dmarc.email (Michael S) (2025-08-26) |
| Re: Undefined behaviour in C23 jameskuyper@alumni.caltech.edu (James Kuyper) (2025-08-26) |
| Re: Undefined behaviour in C23 anton@mips.complang.tuwien.ac.at (2025-09-06) |
| From: | anton@mips.complang.tuwien.ac.at |
| Newsgroups: | comp.compilers |
| Date: | Sat, 06 Sep 2025 17:15:18 +0000 |
| Organization: | Compilers Central |
| References: | 25-08-002 25-08-004 25-08-009 |
| Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="65878"; mail-complaints-to="abuse@iecc.com" |
| Keywords: | C, standards |
| Posted-Date: | 06 Sep 2025 18:53:08 EDT |
David Brown <david.brown@hesbynett.no> writes:
>On 21/08/2025 07:44, anton@mips.complang.tuwien.ac.at wrote:
>> Martin Ward <mwardgkc@gmail.com> writes:
>Imagine
>if car manufacturers had to limit the speeds of new cars to 10 miles per
>hour, because some drivers a century ago assumed that they could safely
>put their foot flat on the accelerator without hitting the horse and
>cart in front of them.
The latter assumption is wrong even with 10mph. If cars had been limited to
10mph, that would hopefully have prevented the kind of "progress" that is taking
>1M lives per year, every year. But that's a different discussion.
>> And the practice is that the people in C compiler maintenance reject
>> bug reports as RESOLVED INVALID when the code exercises undefined
>> behaviour, even when the code works as intended in earlier versions of
>> the compiler and when the breakage could be easily fixed (e.g., for
>> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66804> and
>> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709> by using movdqu
>> instead of movdqa).
...
>But the solution is certainly /not/ to say that people everyone correct
>C code and compiling with high optimisations should get slower results
>because someone else previously wrote code that made unwarranted and
>unchecked assumptions about particular compilers and particular target
>processors.
Ah, yes, that claim, as usual without empirical support. I actually measured it
for such a claim made in
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709>10>, and found that there
is no performance advantage on K10 and Core 2 (CPUs that were claimed to benefit
from movdqa), nor on Sandy Bridge, Haswell, or Skylake from using movdqa instead
of movdqu. The biggest speed difference in favour of MOVDQA was a factor 1.0014
on Core 2, but there it would have been better to just use scalar code. Read all
about it at <http://www.complang.tuwien.ac.at/anton/autovectors/>.
>> But they not always do so: The SATD function from the SPEC benchmark
>> 464.h264ref exercises undefined behaviour, and a pre-release version
>> of gcc-4.8 generated code that did not behave as intended. The
>> release version of gcc-4.8 compiled 464.h264ref as intended (but later
>> a similar case that was not in a SPEC program
>> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66875> was rejected as
>> RESOLVED INVALID).
>
>So the gcc developers made an exception for a particularly important,
>useful and common case?
No, they made an exception for a benchmark.
>> When I brought this up, the reactions reached from
>> flat-out denial that it ever happened (despite it being widely
>> publicized <https://lwn.net/Articles/544123/>) through a claim that
>> the "optimization" turned out to have no benefit (and yet the similar
>> case mentioned above still was "optimized" in a later gcc version) to
>> a statement along the lines that 464.h264ref is a relevant benchmark.
>>
>
>Maybe this particular case was handled badly, or at least the
>communications involved were bad. It was over a decade ago, in a
>pre-release candidate compiler. (Pre-release candidates are used
>precisely to check if changes cause trouble with real-world code.) How
>long are you going to hold a grudge about this?
Have they changed their attitude since then? If not, why should it it matter
that this was over ten years ago?
>> They also have their test suites of programs for regression testing,
>> and any behavioural change in these programs that is visible in this
>> regression testing probably leads to applying the optimization in a
>> less aggressive way.
>>
>
>I would assume that they try to avoid UB in their test suite code
>(though of course gcc developers can have bugs and mistakes like anyone
>else).
Throwing out all programs with undefined behaviour from their test suite would
probably reduce the test suite a lot, and would ensure that regressions like not
compiling the Linux kernel as intended would reappear all the time.
>> How do tests get added into the regression test suite? Ideally, if
>> somebody reports a case where a program behaves in one way in an
>> earlier version of the same compiler and differently in a later
>> version, that program and its original behaviour should usually be
>> added to the test suite
>> <https://www.complang.tuwien.ac.at/papers/ertl17kps.pdf>, but in gcc
>> this does not happen (see the bug reports linked to above).
>
>In what bizarre world would that be "ideal" ?
In a world where an existing program that works as intended on one version of
the compiler is expected to work on later versions of the compiler.
>> In other cases, in particular
>> -fno-tree-vectorize, using the flag just avoids slowdowns from the
>> "optimization".
>
>You know better than the solid majority of programmers that
>"optimisation" is as much an art as a science
I always thought optimization was engineering. Anyway, to actually back up my
claim with numbers (unlike the handwaving that usually goes along with claims of
speedups from assuming that C programs don't perform undefined behaviour),
here's some data.
The measurements were done with Gforth commit
4224ab5fafea970dade64b04493ef690da8b3c32 compiled and run on Debian 12
(gcc-12.2.0), and run on core 1 of a Ryzen 8700G (Zen4 ~5GHz). Two
versions were measured:
gforth-fast-no-tree-vectorize is the gforth-fast built by default.
gforth-fast-tree-vectorize is built by removing "no-tree-vectorize"
from configure.ac and rebuilding from scratch.
Here are numbers from running "gforth-fast-... onebench.fs". The
numbers are times in seconds.
sieve bubble matrix fib fft
0.020 0.021 0.011 0.029 0.014 gforth-fast-no-tree-vectorize
0.365 0.369 0.348 0.435 0.184 gforth-fast-tree-vectorize
So that's slowdown factors of 13.1-31.6 from using tree-vectorize.
Where is that coming from?
The first thing I notice is that gforth-fast-tree-vectorize sanity checks the
code produced by gcc and decides to disable dynamic code generation and all the
optimizations that build on that. So let's disable that for
gforth-fast-no-tree-vectorize, too:
sieve bubble matrix fib fft
0.020 0.021 0.011 0.029 0.014 gforth-fast-no-tree-vectorize
0.145 0.134 0.120 0.145 0.057 gforth-fast-no-tree-vectorize --no-dynamic
0.365 0.369 0.348 0.435 0.184 gforth-fast-tree-vectorize
So -ftree-vectorize achieves a slowdown factor of 4.1-10.9 by disabling Gforth's
dynamic code generation, and a slowdown by a factor 2.5-3.2 beyond that. Where
does the latter come from? Let's look at the Forth word "@", which loads a cell
(a machine word) from memory:
For gforth-fast-no-tree-vectorize --no-dynamic
' disasm-gdb is discode ok
see @
Code @
0x0000558ff440e50f <gforth_engine2+6927>: add $0x8,%rbx
0x0000558ff440e513 <gforth_engine2+6931>: mov 0x0(%r13),%r13
0x0000558ff440e517 <gforth_engine2+6935>: mov (%rbx),%rax
0x0000558ff440e51a <gforth_engine2+6938>: jmp *%rax
end-code
The second instruction does the actual work, the rest is threaded-code
dispatch (optimized away in typical code if dynamic code generation is
enabled).
Now with gforth-fast-tree-vectorize:
Code @
0x000055aa501f75e6 <gforth_engine2+11238>: add $0x8,%rbx
0x000055aa501f75ea <gforth_engine2+11242>: mov (%r8),%rcx
0x000055aa501f75ed <gforth_engine2+11245>: mov (%rbx),%rax
0x000055aa501f75f0 <gforth_engine2+11248>: mov %r14,0x8(%rsp)
0x000055aa501f75f5 <gforth_engine2+11253>: mov %rax,%r11
0x000055aa501f75f8 <gforth_engine2+11256>: mov %r15,%r9
0x000055aa501f75fb <gforth_engine2+11259>: mov %rcx,0x10(%rsp)
0x000055aa501f7600 <gforth_engine2+11264>: jmp 0x55aa501f4a99 <gforth_engine2+153>
end-code
0x55aa501f4a99 56 discode
0x000055aa501f4a99 <gforth_engine2+153>: movq 0x8(%rsp),%xmm0
0x000055aa501f4a9f <gforth_engine2+159>: movq %r9,%xmm1
0x000055aa501f4aa4 <gforth_engine2+164>: movhps 0x8(%rsp),%xmm1
0x000055aa501f4aa9 <gforth_engine2+169>: movhps 0x10(%rsp),%xmm0
0x000055aa501f4aae <gforth_engine2+174>: movhlps %xmm0,%xmm5
0x000055aa501f4ab1 <gforth_engine2+177>: movq %xmm0,%r14
0x000055aa501f4ab6 <gforth_engine2+182>: movq %xmm1,%r15
0x000055aa501f4abb <gforth_engine2+187>: movhps %xmm1,0x18(%rsp)
0x000055aa501f4ac0 <gforth_engine2+192>: movq %xmm5,%r8
0x000055aa501f4ac5 <gforth_engine2+197>: mov %r15,%rdi
0x000055aa501f4ac8 <gforth_engine2+200>: mov %r14,%rsi
0x000055aa501f4acb <gforth_engine2+203>: mov %r8,%rcx
0x000055aa501f4ace <gforth_engine2+206>: jmp *%r11
GCC produced similar code in the 3.x timeframe without auto-vectorization, but
they eventually managed to fix that. My guess at what is happening here is that
the auto-vectorizer tries to vectorize accesses to adjacent memory locations
somewhere in gforth_engine2(), this reduces the precision of the liveness
tracking, resulting in all these register-register and register-memory moves,
and they migrate from the original places to the shared indirect jump that gcc
internally introduces for all the occurences of "goto *" in the source code.
- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
Return to the
comp.compilers page.
Search the
comp.compilers archives again.