Related articles |
---|
Indirect threading other than GCC? heng@ag.arizona.edu (Heng Yuan) (2001-12-08) |
Re: Indirect threading other than GCC? anton@mips.complang.tuwien.ac.at (2001-12-09) |
Re: Indirect threading other than GCC? nickgeo@merle.acns.nwu.edu (Nicholas Geovanis) (2001-12-11) |
Re: Indirect threading other than GCC? anton@mips.complang.tuwien.ac.at (2001-12-15) |
From: | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
Newsgroups: | comp.compilers |
Date: | 15 Dec 2001 00:34:39 -0500 |
Organization: | Institut fuer Computersprachen, Technische Universitaet Wien |
References: | 01-12-034 01-12-037 01-12-056 |
Keywords: | architecture, performance |
Posted-Date: | 15 Dec 2001 00:34:39 EST |
Nicholas Geovanis <nickgeo@merle.acns.nwu.edu> writes:
>Tell me whether I understand the Pentium indirect-threading performance
>penalty which you documented in 1995: An I-cache miss causes cache
>invalidation for *both* I- and D-caches,
The I-cache line is not present, so it cannot be invalidated.
A simple (but not necessarily correct, see below) model is: The line
is not in the I-cache, so it only invalidates the corresponding
D-cache line (if that line is present). So on the next D-cache access
there is a D-cache miss, and the access invalidates the corresponding
I-cache line, leading to an I-cache miss later on. Repeat.
This happens if you do alternating data and instruction accesses to
the same cache-line-sized region. E.g., consider primitives in a
traditional-style indirect-threaded Forth: the code field is directly
in front of the code, and is accessed directly before executing the
code; however, the code can be separated from the data without too
much hassle. With direct threaded code, you have the dual of the
situation: code in the code field adjacent to data of non-primitives
(no so easy to avoid, but still possible, see
http://www.complang.tuwien.ac.at/papers/ertl01.ps.gz).
> which means that with a "normal"
>(unoptimised) indirect-thread implementation each cache would be
>invalidated, flushed and filled twice for *each and every* thread "jump".
>Do I have that right?
I do not know exactly what happens, but the timings indicate that
there are at most two (not four) L1 cache refills happening for each
pair of data and instruction accesses; in some cases, there is only
time for less than two normal refills. Nowadays I might be able to
use performance counters to find out more if I had access to a Pentium
with the appropriate software.
> This was only on the original Pentium?
The fundamental problem is there in all processors with split caches.
Younger architectures fix this by making it software's job to keep the
I-cache coherent with the D-cache. But for older architectures with a
strong compatibility requirement all implementations with split caches
have the problem in some form or other.
For the IA32 architecture the K5 and K6-* have the problem in a form
similar to the Pentium (and Pentium MMX). The P6 (PentiumPro and
later), K7 (Athlon and later), and Pentium 4 cores are a little more
sophisticated: on these chips an I-cache/trace-cache invalidation only
happens when there is a write to the D-cache line.
- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/home.html
Return to the
comp.compilers page.
Search the
comp.compilers archives again.