|x86 Instruction Scheduling firstname.lastname@example.org (2003-06-03)|
|Re: x86 Instruction Scheduling email@example.com (Vladimir Makarov) (2003-06-05)|
|Re: x86 Instruction Scheduling firstname.lastname@example.org (2003-06-20)|
|From:||email@example.com (Paul Hsieh)|
|Date:||20 Jun 2003 00:16:29 -0400|
|Posted-Date:||20 Jun 2003 00:16:29 EDT|
Vladimir Makarov <firstname.lastname@example.org> wrote:
> c0d1f1ed wrote:
> > I have a project swShader (http://sw-shader.sourceforge.net) which
> > could benefit a lot from instruction scheduling. Basically it's a
> > compiler for DirectX 9 pixel shader instructions. Every shader
> > instruction gets substituted by MMX/SSE instructions, and register
> > allocation is handled automatically. Because of this substitution,
> > dependent instructions are close together, causing inefficiencies. The
> > ps 2.0 specifications do not have jumps or calls, so I often have more
> > than a hundred instructions that could be scheduled as one block.
> > So what kind of fast heuristics should 'always work'? I'm also looking
> > for more than one heuristic, or a very configurable one so the optimal
> > can be manually selected after benchmarking so it's more or less
> > system independent. Because of the out-of-order execution and many
> > unknown parameters I think it's hard to solve this perfectly, but what
> > is considered a good method for MMX/SSE instruction sheduling?
See: http://www.pobox.com/~qed/sched.zip . I wrote it up yesterday for
another purpose, but I am sure it will apply just as relevantly for
your problem. It assumes an idealized infinite-way super-scalar CPU,
however, the results should still be quite effective even on real
world practical processors, which are really only between 1-way and
3-way. The code should be fairly simple to understand, and can be
used experimentally as is since its just a text->text utility thus
To be really effective, however, it would make sense for you to build
some pattern recognizing kinds of optimizations, between cycle groups.
paddb mm2, mm1
paddb mm1, mm3
paddb mm0, mm1
paddb mm0, mm2
The last two instructions can (and should) be switched because "paddb"
is a commutative operation, but my algorithm does not know this, and
thus cannot perform the switch.
> It is still important to have insn scheduling for out-of-order
> speculative execution processors for big basic blocks and loops
> because of constraints of processor register renaming buffer and
> because constraints of number of branches which the processor goes
> trough for speculative execution.
Well, this depends on the processor. P3's and P4's have a "register
alloc" stage, which exposes a clear bottleneck, however, I believe
that the P4 has enough rename register to never be blocked at this
stage (I dont think that the P3 has enough to guarantee no stalls.)
The Athlon/Opteron has an "implicit" rename register allocation
strategy, so that its guaranteed to never run out.
The *real* benefit to software rescheduling is to more explicitely
expose the core bottlenecks in your code. Its often hard to see the
critical path without reordering the actual instructions anyway.
> Optimal solution of insn scheduling is NP-complete problem.
I'm sure it is, if you take into consideration the real resource
limitations (finite number of pipelines, specific ALU usage
constrains, scheduling limitations, etc.) Idealize the processor and
the problem is somewhat easier (certainly its no worse than n^3.)
> [...] Therefore
> heuristic algorithms are used. So usually two major used heuristics
> are critical path length or issuing as many insns as possible on a
> processor cycle (see Steven Muchnick's book).
The scheduler I wrote in the URL above only uses the "issue as many
insns as possible" heuristic. "Critical Path" is not that well
defined because it can change once the instructions are re-ordered or
other optimizations are performed.
Return to the
Search the comp.compilers archives again.