prefetching (was:Re: Future of architecture) (Mark Smotherman)
Fri, 10 Nov 1995 21:06:48 GMT

          From comp.compilers

Related articles
prefetching (was:Re: Future of architecture) (1995-11-10)
Re: prefetching (was:Re: Future of architecture) eanders@ayer.CS.Berkeley.EDU (1995-11-17)
| List of all articles for this month |

Newsgroups: comp.arch,comp.compilers
From: (Mark Smotherman)
Keywords: architecture
Organization: Clemson University
References: <47kodf$> <4807v2$>
Date: Fri, 10 Nov 1995 21:06:48 GMT

Have these software prefetch techniques been investigated? If so, who has
published them and/or who is doing them in a production compiler/linker?
Are they wins, and if so, by how much?

1. inst. and data prefetch with subroutine calls?

      - upstream from a procedure call, issue an inst. prefetch for
              the procedure entry point

      - upstream from a procedure call, issue a data prefetch (or line
              allocate, which omits refill) for the procedure's stack frame

      - upstream from a procedure return, issue a data prefetch for
              the caller's stack frame

      - let the linker associate global data areas with the procedures that
              use these areas, and thus upstream from a procedure call, have the
              linker insert data prefetches for the associated global data areas

2. heap management tricks?

      - in-line a routine at each malloc call site that initially allocates
              a contiguous region of multiple blocks (each of the request size)
              and then doles these out as it is re-invoked (this is similar to
              the logical record/physical record handling in I/O and might help
              increase spatial locality in large-line-size caches) -- I know of
              malloc implementations that keep separate free lists based on fixed-
              size allocations but call-site specific allocation seems like it
              could increase locality

      - some students and I tried adding a prefetch pointer to a linked list
              structure to enable us to process three list nodes per list-traversal
              iteration; we obtained a 24% improvement in per node time on an Alpha
              21164 (but we got a 162% improvement by using a circular buffer - see
              the second bullet in

Mark Smotherman, Computer Science Dept., Clemson University, Clemson, SC

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.