Compiler Controlled Cache With Direct Mapped Subsets

Stanley Lass <stanlass@netins.net>
8 Jun 1996 22:18:12 -0400

          From comp.compilers

Related articles
Compiler Controlled Cache With Direct Mapped Subsets stanlass@netins.net (Stanley Lass) (1996-06-08)
| List of all articles for this month |

From: Stanley Lass <stanlass@netins.net>
Newsgroups: comp.compilers
Date: 8 Jun 1996 22:18:12 -0400
Organization: Compilers Central
Keywords: architecture, storage

In a conventional cache, the allocation of data and
instructions to cache memory is largely by chance. In contrast, in a
compiler controlled cache with direct mapped subsets, the compiler
can allocate data and instructions to the direct mapped subsets in
the cache. The expected benefits include:
      1. eliminate most thrashing misses,
      2. need less cache memory for comparable miss rate,
      3. and by grouping block accesses, reduce processor delay due to
memory access.
      4. use cache addresses for fast access to the cache (having
prefetched the data as a group),
Each direct mapped subset behaves as a standalone cache, and
shares the cache memory access hardware. Each subset is described by
a start address in the cache memory and a size.
The subsets cannot thrash each other; this eliminates most
thrashing misses. (Thrashing can still occur within a subset.)
Consider the simple example of computing the checksum of an
array larger than the cache. In a conventional cache with a least
recently used block replacement strategy, the blocks in the array
will replace all blocks in the cache, i.e. thrash the whole cache,
causing subsequent misses when data in the replaced blocks are
accessed. In contrast, in a cache with direct mapped subsets, the
array blocks can be confined to a two block subset, avoiding the
thrashing of the whole cache.
Generally, each base register would have a direct mapped
subset associated with it. Memory accesses using a given base
register would go through the direct mapped subset associated with
that base register. The compiler would be responsible for planning
the use of the subsets during the execution of the program, i.e.
which data uses which subsets, the size of each subset, and how the
cache memory is allocated to subsets. The compiler would generate
instructions as needed to configure the cache subsets. The planning
would be somewhat similar to a compiler planning the use of a
register set.
Most of the time, each kind of data will have its own
subset. Often, a single data item, e.g. a record, will be loaded
into a subset, then when processing is finished, the next data item
of the same kind will be loaded into the same subset. With several
subsets to utilize, the compiler can load a data item into a subset,
then not disturb it until processing is finished. Because the loaded
data item will not be disturbed, the start address of the data item
in the cache can be passed to the processor, and then the processor
can access the data item using cache addresses. (Cache address
wraparound within a subset must be handled properly.) The data item
can span a number of blocks, and be fetched as a group. On average,
I'd expect that this would result in getting ~2 blocks per DRAM row
access. (A DRAM row access followed by a 2 block transfer takes
significantly less time than 2 independent DRAM row accesses, each
followed by a 1 block transfer.) In some cases, the transfer time of
blocks after the first one will be masked by processing.
For activation records, the compiler could size the subset
to hold roughly the last four activation records. Before accessing
an activation record, an instruction would be executed which fetched
any blocks not present. Interlocks would prevent attempted usage of
data before it was available.
For a cache with direct mapped subsets, I'm assuming that
the compiler will use program flow analysis to do a good job of
reusing the blocks containing data that is no longer needed. There
are two cases here, 1. the data in the subset is no longer needed,
after which new data can be loaded, and 2. the subset itself is no
longer needed, e.g. when leaving an area of the program.
A subset used with an array of data items will typically be
either large enough to hold one data item or all of them. An
exception is table lookups where the frequency and locality of the
lookups affect the optimum size of the subset. Second level cache
subsets will tend to be larger, e.g. where the first level cache
holds one data item of an array, the second level cache may hold the
whole array.
Most of the time, data of a kind will go through a single
subset. For this, there is no coherency problem. However, when two
or more subsets are used to process the same kind of data, and
updating occurs in one of the subsets, then coherency between
subsets needs to be maintained. To maintain coherency, the compiler
could cause the writing of updated data from one cache subset and
if the data is present in the other subset, invalidate it. In
addition, before accessing data in the other subset, an instruction
would be executed which fetched any blocks not present. To maintain
cache coherency in a multiprocessor system, a similar technique
could be used. Usually, only one or two subsets in each processor
would be subject to incoherency.
A cache with direct mapped subsets naturally has the same
large block size that a direct mapped cache has.
If the last block accessed in each subset is retained, then
I estimate that ~50% of the time, the next operand requested from
that subset will be in the retained block. Sequentially accessed
data helps bring the average up. This would reduce the cache
bandwidth needed to support a single processor.
Compared to a conventional cache, a cache with direct mapped
subsets is intuitively estimated to have ~40% of the miss delay and
need ~40-50% as much cache memory. The basis for the estimates
follows.


Basis for Estimate of Cache Performance
With current caches, the different sets of data within a
program are all mingled in the same cache, into one large set, where
the different sets of data thrash each other occasionally, and the
only real defense against the thrashing is to make the cache so large
that the thrashing is infrequent. Going to higher levels of
associativity, e.g. from a direct mapped cache to a four-way set
associative cache helps some.
From a different viewpoint, a conventional cache's hardware
does not know the likelihood of a block being accessed soon. A blind
strategy is used to choose the block to replace when a miss has
occured, e.g. choose the "least recently used" block. This is often
a bad choice, perhaps causing a thrashing miss ~45% of the time. (A
thrashing miss occurs when the block which was replaced must itself
be reloaded.) One thrashing miss can lead to another thrashing miss.
There needs to be a sizable number of blocks in the cache which
won't cause a thrashing miss, perhaps on the order of 50%,
corresponding to half of the cache blocks being unused on average.
(The "unused" cases are that the block won't be accessed for a long
time or the block won't be accessed before being replaced.)
The above assumes that the demand for blocks is spread around
evenly. In fact, the mapped addresses of active blocks will sometimes
cluster, e.g. three active blocks mapping to the same cache set will
produce thrashing in a two way cache. If you double the number of
sets by doubling the cache size, you lessen the frequency of
clustering. I estimate that this effect results in even more of the
cache being unused, perhaps 50-60% on average.
I estimate that ~50% of all misses are thrashing misses, and
that most of these can be avoided. The estimate of ~40% as much miss
delay comes from eliminating most thrashing misses, and the grouping
of DRAM accesses. The latter both reduces the average block access
time and sometimes masks some of the access delay.


Summary and Comments
The compiler controlled cache with direct mapped subsets
provides the framework that enables the compiler's knowledge of
program flow and data usage to be utilized in planning the use of
the cache memory. Cache like behavior is provided where it is useful,
e.g. in table lookups, and more generally, avoiding redundant block
loads, i.e. when the block is already in the cache subset. By using
direct mapping in subsets, the check for a redundant block load is
nearly the same check as in a direct mapped cache.
The compiler optimization of cache usage needs to take into
account the operating system's cache usage. Alternatively, the
operating system could have it's own area of the cache in which it
allocated it's subsets.
Fast response to interrupts can be assured by holding some
portion of the interrupt processing code in an inactive subset, ready
to be used in processing an interrupt.
A second level cache with subsets can support a number of
processors on the same chip without thrashing degradation. Code and
data to be shared among processors can reside in subsets which are
shared. Code and data not to be shared can reside in private subsets.
The techniques used to avoid striding through a disc memory
could be used to avoid striding through DRAM.
Striding through a DRAM row allows fetching just the needed
data. For example, an 8x8 submatrix could be stored row 1, row 2, etc.
in a DRAM row, then the submatrix row accesses would be by contiguous
bytes and submatrix column accesses would be by strided access through
the DRAM row, picking up a value from each submatrix row. In the
second level cache, the matrix rows and columns would be arrays of
stride one. Similarly, a field from sequentially stored records could
be picked up and stored in the second level cache with a stride of one.
This technique would provide a significant reduction in the bandwidth
needed to transfer the strided data, i.e. only the data needed in a
block is transferred, not the whole block every time. Plus, it groups
the DRAM accesses such that one DRAM access can provide many data
values. (This assumes that the DRAM access can include the striding
specification.)
Overall, I estimate that in exchange for the added compiler
complexity, one can reduce processor costs considerably, perhaps ~40%,
and still get comparable performance.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.