Related articles |
---|
Possible data allocation and instruction scheduling algo... ollanes@prodigy.net (Orlando Llanes) (2002-01-24) |
Re: Possible data allocation and instruction scheduling algo... bryan.hayes@hayestechnologies.com (2002-01-30) |
Re: Possible data allocation and instruction scheduling algo... bear@sonic.net (Ray Dillinger) (2002-02-06) |
Re: Possible data allocation and instruction scheduling algo... ollanes@pobox.com (2002-02-16) |
Re: Possible data allocation and instruction scheduling algo... ollanes@pobox.com (2002-02-16) |
From: | "Orlando Llanes" <ollanes@prodigy.net> |
Newsgroups: | comp.compilers |
Date: | 24 Jan 2002 15:00:51 -0500 |
Organization: | Prodigy Internet http://www.prodigy.com |
Keywords: | storage, design |
Posted-Date: | 24 Jan 2002 15:00:51 EST |
I was thinking about an algorithm for padded data allocation,
which then led me to thoughts on instruction scheduling. This
algorithm may make optimized code closer to human optimized code than
it is right now. Bear with me. All of this may have been thought of
before, and this post is not "complete". I also realize that better
solutions may be available. I just want to share my thoughts in case
they may be of use to someone.
Data allocation
---------------
The way this algorithm works is that when data is declared, if it
is not aligned on its appropriate boundary, padding bytes are
inserted. The next time data is declared, the compiler checks if the
padded areas can hold the data while at the same time keeping the
newly declared data aligned on its appropriate boundary. If a padded
area is found, it is changed to reflect the current amount of padded
data. If done right, the compiler shouldn't need to keep track of more
than 3 to 5 areas of padded data.
Obviously this presents cache issues due to data being accessed outside
of their expected order, and may cause problems when working with data files
(ie: images, sound files, word processing documents, etc), but is generally
useful when optimizing for size.
Instruction Scheduling
----------------------
I'm still not up to speed with RISC optimization, so I'll speak in
terms of x86 optimization. I apologize for the being so specific, but
I want to provide "real world" examples.
Two optimizations on any platform are removing register collisions
(CPU stalls when one register is being read from immediately after
being written to because the Nth+1 instruction is waiting for the Nth
instruction to finish), and preventing address generation stalls (CPU
stalls when one address register is referenced immediately after being
set because the CPU needs a certain amount of time between the two
instructions).
Generally speaking, the x86 has 2 pipelines. Generally speaking
again, all instructions can decode/execute in the first pipeline (U),
but only a subset of instructions can decode/execute in the second
pipeline (V) during the same clock cycle/cycles. Generally speaking, a
register should be referenced as a pointer 3 execution cycles after
its use.
The way the instruction scheduling would work is that the status
of each register is kept and updated with each compiled instruction.
Let's say we have the following C/C++ snippet...
*MyPtr = MyFunction(Foo, Bar) + 5;
it would probably compile as follows (assuming that Foo and Bar have
been placed into a register)...
push ebx
push ecx
call MyFunction
add eax, 5
mov esi, [MyPtr]
mov [esi], eax
The code plus attributes would be as follows:
push ebx ; U-pipe, U/V capable, ebx read
push ecx ; V-pipe, U/V capable, ecx read
call MyFunction ; U-pipe, U/V capable but V-pipe preferred
add eax, 5 ; U-pipe, U/V capable, eax write
mov esi, [MyPtr] ; U-pipe, U/V capable, esi write
mov [esi], eax ; U-pipe, U/V capable, esi read, additional cycles
needed: 2
More optimal code would follows...
push ebx
push ecx
mov esi, [MyPtr]
call MyFunction
add eax, 5
mov [esi], eax
With attributes, it would be as follows...
push ebx ; U-pipe, U/V capable, ebx read
push ecx ; V-pipe, U/V capable, ecx read
mov esi, [MyPtr] ; U-pipe, U/V capable, esi write
call MyFunction ; V-pipe, U/V capable but V-pipe preferred
add eax, 5 ; U-pipe, U/V capable, eax write
; V-pipe stall, V capable instruction needed
mov [esi], eax ; U-pipe, U/V capable, esi read, eax read
The way we got from the stalled code to the more optimal code was by
moving instructions up as far as possible while considering each
instruction's constraints. Yes, there is now a register stall in the optimal
code, but that's why I called it "more", not "most" :)
I know that this is grossly simplified, but it's the first step towards
a real solution.
On a side note, when will compilers use some form of artificial
intelligence to optimize the compiler's assembly output? :)
See ya!
Orlando
Return to the
comp.compilers page.
Search the
comp.compilers archives again.