16 Apr 2005 11:13:01 -0400

Related articles |
---|

Generating optimal code from a DAG weitzel@simtec.mb.uni-siegen.de (Michael Weitzel) (2005-04-11) |

Re: Generating optimal code from a DAG touati@prism.uvsq.fr (TOUATI Sid) (2005-04-16) |

Re: Generating optimal code from a DAG tmk@netvision.net.il (2005-04-16) |

From: | tmk@netvision.net.il |

Newsgroups: | comp.compilers |

Date: | 16 Apr 2005 11:13:01 -0400 |

Organization: | http://groups.google.com |

References: | 05-04-028 |

Keywords: | optimize, code |

Michael Weitzel wrote:

*> Hi,*

*>*

*> I am currently working on a code generator for the Intel x86 FPU*

*> (x>=3) which generates code for single expressions which can be very*

*> large and highly redundant (in terms of common subexpressions). My*

*> code generator creates a DAG from the expression tree to eliminate*

*> common subexpressions. The DAG is split into trees and code is*

*> generated for common subtrees (using the algorithm from Bruno and*

*> Lasagne to obtain optimal stack-machine-code). After that, some*

*> peephole optimization replaces (only) pairs of opcodes by single*

*> opcodes using the Intel's opcodes.*

I don't understand why you split the DAG again. You can translate the

DAG directly into RISC-like intermediate language in one DFS pass. You

can replace pairs of opcodes (multiply-add?) even in the same pass.

Recall that DAGs don't naturally translate into stack-machine

languages.

*> I understand that the generated code is not optimal:*

*> - peephole optimization frees up to one level of the stack that*

could

*> be used to generate slightly better code for the sub-trees*

*> - results of sub-tree computations could be stored in the FPU's*

stack

*> (the Intel FPU can access every level of the stack at any time) -*

*> instead I always save them to local variables (in the function's*

*> stack frame).*

You can use register allocation of FP regs to avoid storing some

values. The "stack" of x87 is a nightmare for a compiler writer. Very

fortunately, you can change the register numbers of these regs with fp

exchange instruction, and these instruction is very cheap and can be

executed in parallel with other fp instructions, even on old Pentiums.

I think that on 387 (and 486?) this can be false.

On a very old processor you might want to do save/restore for the

values of the common subexpressions or keep these values on the stack,

and use the stack machine code.

*> - the FPU-stack is not working to capacity when switching from one*

*> sub-tree to the next and the computation of a sub-tree always*

starts

*> on an empty stack.*

I always thought that avoiding the last pop in case of a common

subexpression you'll get as a start point the stack that contains the

value of this expression. This value can be reused later.

*> - visiting the DAG nodes in topological order leaves some option to*

*> delay or prefer code generation for nodes on the same topological*

*> "level"*

*>*

*> In the texts I read, these problems (and probably others I forgot)*

are

*> usually summarized (or "slayed"?) in a sentence like "Code generation*

*> from DAG is NP-complete.". ;-)*

It's a common mistake. Optimal DAG scheduling cannot be NP complete,

because it is not a decision problem. Most of the texts say that

scheduling is NP hard.

*> I haven't seen any proof or exact description of the problem yet.*

What

*> makes code generation from DAG NP-complete? I have some ideas -- but*

how

*> does an optimal algorithm work?*

Well, an optimal algorithm for an NP-hard problem should meanwhile

(until P=NP is proved) test all the possibilities, e.g. all the

possible rearrangings of instructions that don't change the computation

result, and choose the fastest one.

In almost any compiler text you can find a couple of lines about

scheduling a basic block consisting of the DAG of data-dependent

instructions.

*> Are there better algorithms than splitting a DAG into trees? Perhaps*

I

*> can afford using some NP-complete procedure on parts of a DAG*

*> (heuristic)?!*

Optimal scheduling was done for small basic blocks. You usually don't

try to do it for very large blocks (100s or 1000s instructions).

Michael

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.