Re: Nop insertion

Nils <n.pipenbrinck@cubic.org>
Wed, 28 Oct 2009 17:10:30 +0100

          From comp.compilers

Related articles
Nop insertion shreyas76@gmail.com (shrey) (2009-10-27)
Re: Nop insertion n.pipenbrinck@cubic.org (Nils) (2009-10-28)
Re: Nop insertion cr88192@hotmail.com (BGB / cr88192) (2009-10-28)
Re: Nop insertion walter@bytecraft.com (Walter Banks) (2009-10-28)
Re: Nop insertion cfc@shell01.TheWorld.com (Chris F Clark) (2009-10-28)
Re: Nop insertion gneuner2@comcast.net (George Neuner) (2009-10-29)
Re: Nop insertion pertti.kellomaki@tut.fi (Pertti Kellomaki) (2009-10-29)
| List of all articles for this month |

From: Nils <n.pipenbrinck@cubic.org>
Newsgroups: comp.compilers
Date: Wed, 28 Oct 2009 17:10:30 +0100
Organization: Compilers Central
References: 09-10-032
Keywords: architecture
Posted-Date: 30 Oct 2009 11:52:06 EDT

shrey wrote:
> hi
> This is both an architecture and compiler question.
>
> Are there inorder architectures that need precise number of nops
> inserted between operations.


Yes. One example would be the Texas Instrumensts TMS320C64x+ DSP (as
well as most other DSPs of that familiy) may need extra nops. It's
in-order but a VLIW architecture.


Ignoring the VLIW aspect the pipeline somewhat works like this:


You can issue one instruction per cycle. However, an instruction may
take more than one cycles to generate the result and write it into the
destination register. If you access a destination register of such an
multicycle instruction before the result has been written to the
register-file the CPU will not stall. You will read whatever is in that
register at that time.




Here's an example in simplified assembler. Assume MPY has one cycle
result latency and ADD has none. Also assume that you want to write code
that adds two registers and adds the product to another register.




This code won't work because the result of MPY hasn't arrived at the
time the ADD gets executed..


MPY A0, A1, A2 ; A0 = A1 * A2
ADD A3, A3, A0 ; A3 += A0
... ; result of 1st. MPY arrives - to late.


This one works:


MPY A0, A1, A2 ; A0 = A1 * A2
NOP ; wait a moment..
ADD A3, A3, A0 ; result of 1st MPY arrives, A3 += A0




Seems like a waste, but it's actually a feature. You can start a multipy
per cycle even if a prior MPY is still executing. So the following code
is valid:


MPY A0, A1, A2 ; A0 = A1 * A2
MPY A0, A5, A6 ; A0 = A5 * A6
ADD A3, A3, A0 ; result of 1st. MPY arrives, A3 += A0
ADD A3, A3, A0 ; result of 2nd. MPY arrives, A3 += A0


This would calculate A3 = A3 + A1*A2 + A5*A6




What does the compiler do?


It schedules the instructions and makes sure that no instructions
executes before all dependencies have arrived in the destination
registers. In case that no usefull instruction can be placed into these
slots it emits a NOP. That usually happends before and after loops.
Loops itself can often be written with minimal NOPs by using
modulo-scheduling or loop unrolling.


Fun-Fact: The C64x+ DSP even has a multi-cycle NOP and a branch with
built-in NOP to safe code-space because some instructions can take as
long as 6 cycles before the result appears in the destination register.


Hope it helps,


Nils


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.