Re: Indirect threading other than GCC?

anton@mips.complang.tuwien.ac.at (Anton Ertl)
15 Dec 2001 00:34:39 -0500

          From comp.compilers

Related articles
Indirect threading other than GCC? heng@ag.arizona.edu (Heng Yuan) (2001-12-08)
Re: Indirect threading other than GCC? anton@mips.complang.tuwien.ac.at (2001-12-09)
Re: Indirect threading other than GCC? nickgeo@merle.acns.nwu.edu (Nicholas Geovanis) (2001-12-11)
Re: Indirect threading other than GCC? anton@mips.complang.tuwien.ac.at (2001-12-15)
| List of all articles for this month |
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.compilers
Date: 15 Dec 2001 00:34:39 -0500
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
References: 01-12-034 01-12-037 01-12-056
Keywords: architecture, performance
Posted-Date: 15 Dec 2001 00:34:39 EST

Nicholas Geovanis <nickgeo@merle.acns.nwu.edu> writes:
>Tell me whether I understand the Pentium indirect-threading performance
>penalty which you documented in 1995: An I-cache miss causes cache
>invalidation for *both* I- and D-caches,


The I-cache line is not present, so it cannot be invalidated.


A simple (but not necessarily correct, see below) model is: The line
is not in the I-cache, so it only invalidates the corresponding
D-cache line (if that line is present). So on the next D-cache access
there is a D-cache miss, and the access invalidates the corresponding
I-cache line, leading to an I-cache miss later on. Repeat.


This happens if you do alternating data and instruction accesses to
the same cache-line-sized region. E.g., consider primitives in a
traditional-style indirect-threaded Forth: the code field is directly
in front of the code, and is accessed directly before executing the
code; however, the code can be separated from the data without too
much hassle. With direct threaded code, you have the dual of the
situation: code in the code field adjacent to data of non-primitives
(no so easy to avoid, but still possible, see
http://www.complang.tuwien.ac.at/papers/ertl01.ps.gz).


> which means that with a "normal"
>(unoptimised) indirect-thread implementation each cache would be
>invalidated, flushed and filled twice for *each and every* thread "jump".
>Do I have that right?


I do not know exactly what happens, but the timings indicate that
there are at most two (not four) L1 cache refills happening for each
pair of data and instruction accesses; in some cases, there is only
time for less than two normal refills. Nowadays I might be able to
use performance counters to find out more if I had access to a Pentium
with the appropriate software.


> This was only on the original Pentium?


The fundamental problem is there in all processors with split caches.
Younger architectures fix this by making it software's job to keep the
I-cache coherent with the D-cache. But for older architectures with a
strong compatibility requirement all implementations with split caches
have the problem in some form or other.


For the IA32 architecture the K5 and K6-* have the problem in a form
similar to the Pentium (and Pentium MMX). The P6 (PentiumPro and
later), K7 (Athlon and later), and Pentium 4 cores are a little more
sophisticated: on these chips an I-cache/trace-cache invalidation only
happens when there is a write to the D-cache line.


- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/home.html


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.