Re: Nop insertion

George Neuner <>
Thu, 29 Oct 2009 00:48:42 -0400

          From comp.compilers

Related articles
Nop insertion (shrey) (2009-10-27)
Re: Nop insertion (Nils) (2009-10-28)
Re: Nop insertion (BGB / cr88192) (2009-10-28)
Re: Nop insertion (Walter Banks) (2009-10-28)
Re: Nop insertion (Chris F Clark) (2009-10-28)
Re: Nop insertion (George Neuner) (2009-10-29)
Re: Nop insertion (Pertti Kellomaki) (2009-10-29)
| List of all articles for this month |

From: George Neuner <>
Newsgroups: comp.compilers
Date: Thu, 29 Oct 2009 00:48:42 -0400
Organization: A noiseless patient Spider
References: 09-10-032
Keywords: architecture
Posted-Date: 30 Oct 2009 11:54:41 EDT

On Tue, 27 Oct 2009 20:06:55 -0700 (PDT), shrey <>

>Are there inorder architectures that need precise number of nops
>inserted between operations.

I've never seen NOPs used so deliberately outside of device driver
code. Even there it is far more typical to spin in a loop than to
insert a particular number of NOPs unless the number of delay cycles
needed is quite small. Some ISAs don't even have NOP equivalents.

As John mentioned, there have been non-interlocked architectures, but
their use was brief and they are rare nowadays ... mainly you'd be
looking at very early RISC or VLIW designs, perhaps some early super
computers also. I'm not aware of any non-interlocked CISC designs.

AFAIK, the only remaining typical use of NOP insertion is for filling
an unused branch delay slot. But even there, most processors now can
stop execution of a delay slot instruction if it is path dependent and
the branch goes the wrong way.

>If so, how does a compiler guarantee that? I am especially thinking
>of producers and consumers in faraway basic blocks.

I'm not sure what you mean by "producers" and "consumers" ... I'm
referring here to register loads (reads from memory) and subsequent
dependent instructions that require the loaded value.

On a non-interlocked CPU, a load has to be scheduled far enough in
advance to guarantee completion before any instruction that uses the
value is executed. That requires knowledge of the memory system
because the load could be served from the cache or have to go all the
way to main memory. Worst case could be many 100's of cycles and
virtual memory is effectively impossible as no scheduling algorithm
can deal with the tens of thousands of cycles needed for disk paging.

After the initial non-interlocked designs proved too difficult to
program effectively, there was a brief time in which the use of load
barrier instructions were tried. After a load was issued, the first
instruction to use the loaded value would be proceeded by a barrier
instruction that tested whether the load had completed. The barrier
variously threw some kind of exception or stalled until the load

Barrier instructions worked, but CPU designers realized quickly that
internal pipeline interlocks made more sense and would be required for
increasing unit parallelism inside the CPU. Virtually all designs now
use interlocks to prevent dependent instructions from executing until
required loads complete.

(This has nothing to do with so-called memory "fence" instructions
available on many modern CPUs. To maximize performance, many CPUs can
execute instructions in a different order than they appear in the
input stream. The fence instruction guarantees that data writes
proceeding the fence in the input stream have all completed before
continuing, so fences can be used to deliberately sequence writes,
e.g., to deal with memory mapped device registers which must be
accessed in a particular order.)

> Any pointers to any existing work?

I'm not certain what you're really looking for ... I guess I would
look more deeply into instruction scheduling and see if what you find
answers your questions. Interlocks prevent incorrect behavior (using
the wrong value), but they stall the pipeline, waste time and may idle
processing units that could be doing productive work, so it still
makes sense to try to schedule loads such that they usually complete
before their data is needed.

The following doesn't pertain directly to your question about NOPs,
but is related to data dependency scheduling and looking into some of
the issues might help you.

Some work on optimal code and data "staging" was done early on. Before
there were automatic cache controllers, many early super and mainframe
computer designs had tiered memories ... a large main memory of cheap,
slow DRAM and one or more staging memories using faster DRAM or SRAM.
In many cases it was required to use these staging areas for
processing or for I/O because the particular DMA unit involved had a
limited address range. It was up to the programmer to make sure the
data was in the right place at the right time and this drove quite a
bit of theoretical work on trace scheduling.

(This was still the case for disk I/O on early PCs - the first PC DMA
controllers had a 16-bit address register so disk buffers had to be in
the first 64KB of memory. Prior to general 32-bit DMA, there were
also designs that used 20-bit controllers which required buffers to be
in the first 1MB.)

Many modern DSPs still use a 2 tiered memory architecture - they have
an internal fast SRAM block and then can also access external memory
which may run at a different speed. Many DSPs have a memory->memory
DMA channel to allow staging in parallel with processing, but I'm not
aware of any DSP compiler that has automatic staging support ...
AFAIHS, staging still has to be done explicitly by the programmer.

You might also look into research on data "tiling" which is
(re)organizing data for enhanced locality. A lot of scientific code
uses huge data arrays which don't play well with the LRU algorithms of
virtual memory. Tiling looks into rearranging data and algorithms to
minimize VMM paging. Again, I'm not aware of any compiler that can do
this automatically.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.