|Indirect threading other than GCC? email@example.com (Heng Yuan) (2001-12-08)|
|Re: Indirect threading other than GCC? firstname.lastname@example.org (2001-12-09)|
|Re: Indirect threading other than GCC? email@example.com (Nicholas Geovanis) (2001-12-11)|
|Re: Indirect threading other than GCC? firstname.lastname@example.org (2001-12-15)|
|From:||email@example.com (Anton Ertl)|
|Date:||15 Dec 2001 00:34:39 -0500|
|Organization:||Institut fuer Computersprachen, Technische Universitaet Wien|
|References:||01-12-034 01-12-037 01-12-056|
|Posted-Date:||15 Dec 2001 00:34:39 EST|
Nicholas Geovanis <firstname.lastname@example.org> writes:
>Tell me whether I understand the Pentium indirect-threading performance
>penalty which you documented in 1995: An I-cache miss causes cache
>invalidation for *both* I- and D-caches,
The I-cache line is not present, so it cannot be invalidated.
A simple (but not necessarily correct, see below) model is: The line
is not in the I-cache, so it only invalidates the corresponding
D-cache line (if that line is present). So on the next D-cache access
there is a D-cache miss, and the access invalidates the corresponding
I-cache line, leading to an I-cache miss later on. Repeat.
This happens if you do alternating data and instruction accesses to
the same cache-line-sized region. E.g., consider primitives in a
traditional-style indirect-threaded Forth: the code field is directly
in front of the code, and is accessed directly before executing the
code; however, the code can be separated from the data without too
much hassle. With direct threaded code, you have the dual of the
situation: code in the code field adjacent to data of non-primitives
(no so easy to avoid, but still possible, see
> which means that with a "normal"
>(unoptimised) indirect-thread implementation each cache would be
>invalidated, flushed and filled twice for *each and every* thread "jump".
>Do I have that right?
I do not know exactly what happens, but the timings indicate that
there are at most two (not four) L1 cache refills happening for each
pair of data and instruction accesses; in some cases, there is only
time for less than two normal refills. Nowadays I might be able to
use performance counters to find out more if I had access to a Pentium
with the appropriate software.
> This was only on the original Pentium?
The fundamental problem is there in all processors with split caches.
Younger architectures fix this by making it software's job to keep the
I-cache coherent with the D-cache. But for older architectures with a
strong compatibility requirement all implementations with split caches
have the problem in some form or other.
For the IA32 architecture the K5 and K6-* have the problem in a form
similar to the Pentium (and Pentium MMX). The P6 (PentiumPro and
later), K7 (Athlon and later), and Pentium 4 cores are a little more
sophisticated: on these chips an I-cache/trace-cache invalidation only
happens when there is a write to the D-cache line.
M. Anton Ertl
Return to the
Search the comp.compilers archives again.