MIPS instruction scheduling

wendt@CS.ColoState.EDU (alan l wendt)
Tue, 12 Oct 1993 16:01:30 GMT

          From comp.compilers

Related articles
Mips instruction scheduling wendt@CS.ColoState.EDU (1993-10-07)
MIPS instruction scheduling wendt@CS.ColoState.EDU (1993-10-12)
| List of all articles for this month |

Newsgroups: comp.compilers
From: wendt@CS.ColoState.EDU (alan l wendt)
Keywords: assembler, summary, architecture
Organization: Colorado State University, Computer Science Department
Date: Tue, 12 Oct 1993 16:01:30 GMT

I received a number of replies to my query about instruction scheduling
for MIPS. A number of them wanted a summary of what I heard.
Unfortunately I didn't post that I would summarize, so some of the
repliers may have not have anticipated being posted, although no-one
specifically requested anonymity. Rather than going back to all the
repliers, I'll post the replies and their names separately. I hope that
is satisfactory with everyone, and mucho thanks to all who replied.




..............................................................


First of all, you can use the GNU assembler, which does not scheduling.


The big problem is getting it right, paritcularly since the MIPS R[23]000
do not have interlocks. The newer r6000 and r4000 do have load
interlocks, and so are more forgiving (though you have to watch out for
other interlock problems). We got bit early in OSF/1 (we use gas) in a
sequence that looks like:


mult $2,$3
mflo $4
mult $5,$6
mflo $7


In particular, the mflo needs two cycles afterwards where you must not
modify the hi/lo registers (and mult does modify them). I found when
doing the GCC port for the MIPS, that about 50% of the loads were followed
by an instruction that did not depend on the load, and that GCC's
instruction scheduling pushed this to 60%.


Here are some statistics from the last time I built stuff on the MIPS
using the -mstats switch and a perl reduction program (we have since moved
on to a 486 reference platform, so I don't touch the MIPS much anymore):


Statistics for the compiler:


              3320 functions
              3184 frame pointer(s) omitted, % = 95.9
                534 leaf function(s), % = 16.1
                136 function(s) called alloca, % = 4.1
                  18 function(s) called setjmp, % = 0.5
                  53.3 average stacksize (in bytes)
                    0 minimum stacksize (in bytes)
            10304 maximum stacksize (in bytes)
                    3.9 average # of saved GP regs
[531,409,447,445,311,244,184,153,134,105,357]
                    0 minimum # of saved GP regs
                  10 maximum # of saved GP regs
                    0.0 average # of saved FP regs [3311,7,2]
                    0 minimum # of saved FP regs
                    2 maximum # of saved FP regs
            90246 load delay slot(s)
            53691 load delay slot(s) filled, % filled = 59.5
            97745 jump delay slot(s)
            59956 jump delay slot(s) filled, % filled = 61.3
            32653 2 word data reference(s)
              2685 3 word data reference(s)


Statistics for the kernel:


              2900 functions
              2900 frame pointer(s) omitted, % = 100.0
                526 leaf function(s), % = 18.1
                    0 function(s) called alloca, % = 0.0
                    0 function(s) called setjmp, % = 0.0
                  45.5 average stacksize (in bytes)
                    0 minimum stacksize (in bytes)
              1824 maximum stacksize (in bytes)
                    3.4 average # of saved GP regs [526,479,309,345,301,265,177,141,108,69,180]
                    0 minimum # of saved GP regs
                  10 maximum # of saved GP regs
                    0.0 average # of saved FP regs [2900]
                    0 minimum # of saved FP regs
                    0 maximum # of saved FP regs
            46347 load delay slot(s)
            29210 load delay slot(s) filled, % filled = 63.0
            48541 jump delay slot(s)
            32749 jump delay slot(s) filled, % filled = 67.5
            10941 2 word data reference(s)
                465 3 word data reference(s)


Statistics for Perl:


                631 functions
                629 frame pointer(s) omitted, % = 99.7
                  62 leaf function(s), % = 9.8
                    2 function(s) called alloca, % = 0.3
                  13 function(s) called setjmp, % = 2.1
                  61.2 average stacksize (in bytes)
                    0 minimum stacksize (in bytes)
              2376 maximum stacksize (in bytes)
                    4.6 average # of saved GP regs [62,37,82,81,92,49,45,54,34,22,73]
                    0 minimum # of saved GP regs
                  10 maximum # of saved GP regs
                    0.0 average # of saved FP regs [611,18,2]
                    0 minimum # of saved FP regs
                    2 maximum # of saved FP regs
            28285 load delay slot(s)
            16741 load delay slot(s) filled, % filled = 59.2
            30134 jump delay slot(s)
            16216 jump delay slot(s) filled, % filled = 53.8
            12650 2 word data reference(s)
                218 3 word data reference(s)


..............................................................




About two years ago a student at our institute has implemented an
experimental scheduler for MIPS on DECStations. Basically this worked
fine, but the drawbacks were:


- It is not so trivial to parse the assembly instructions.
- Your scheduler must run after the register allocator. Building a prepass
    scheduler or combined scheduler / register allocator is difficult.
- You cannot split certain macro instructions of the assembler. I would have
    to look up examples, but I remember that the code becomes significantly
    slower from that.




..............................................................


We have done it in different ways. There have been no pitfalls. The
advantage of doing instruction scheduling before the assembler is that you
have better alias information (but only if you collect alias information
in your compiler), the disadvantage is that you cannot schedule well, if
you use macro instructions (instructions that are more than one machine
instruction, e.g. la (load address) of a variable with the size greater
than 8 byte).


..............................................................


We've experienced the same problem here, also on a DECstation. Our
solution was to use GNU as, which does no assembly time re-ordering.


Good luck,


..............................................................




The real problem with scheduling before the assembler is the psuedo ops
which the assembler expands. For example la, and some of the branches.
The la (load address) expands into a lhi and and ori, which can't be
scheduled until after the expansion. Which compiler are you using? I
would suggest that you use gcc since you can set it up as if it were using
gas which implies that there is no postpass scheduling. With gcc you can
also control output of all the assembler directives. Besides that you can
modify the code in the compiler to output lhi and ori instead of la.


..............................................................




Look at the Gnu C compiler output - they do their own instruction
scheduling. You might send mail to Michael Meissner
<meissner@pasta.osf.org> asking for information, since he implemented
this.




..............................................................


    Of course, if what you want to do is list scheduling, you have
a scheduler there, and in fact newer versions of rocket allow you
to specify a ddg and get a scheduled list of instructions.
[Rocket is Phil Sweany's optimizing code generator and scheduler].


..............................................................




I have disabled and scheduled by hand and had no problems. The GCC MIPS
port switches frequently between noreorder and reorder, and seems to do
well. I forget who wrote the port, but I think his e-mail address is in
the machine description.


As far as I know, the only pitfalls are that you have to do a better job
of scheduling to make it worth your while :-)




..............................................................




The only "pitfall" in scheduling the code yourself is handling the various
"hazards" in the MIPS architecture. i.e., in MIPS 1, you *must* have an
instruction following a load that does not use the result of the load.
That's the most common one, but you also need to deal with the hazards
involving the control register. The Kane book has a good section on it.
Also, the MIPS compiler will emit warnings about hazards when you use the
.noreorder directive.


Good Luck. We'd certainly like to hear if you can do a better job.




..............................................................




Thanks to:


Todd Austin
Dirk Grunwald
Andi Krall
Michael Meissner
Thomas Mueller
D. Pardo
James Rose
Mark Streich
Phil Sweany
--
Alan Wendt
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.