Re: Memory allocator performance in C/C++ compilers

David Chase <chase@world.std.com>
25 Jan 1999 21:55:32 -0500

          From comp.compilers

Related articles
Memory allocator performance in C/C++ compilers eem12@cornell.edu (ed mckenzie) (1999-01-22)
Re: Memory allocator performance in C/C++ compilers eodell@pobox.com (1999-01-25)
Re: Memory allocator performance in C/C++ compilers chase@world.std.com (David Chase) (1999-01-25)
Re: Memory allocator performance in C/C++ compilers vmakarov@cygnus.com (Vladimir Makarov) (1999-01-25)
Re: Memory allocator performance in C/C++ compilers eem12@cornell.edu (ed mckenzie) (1999-01-27)
Re: Memory allocator performance in C/C++ compilers johnsgreen@worldnet.att.net (John S. Green) (1999-01-27)
Re: Memory allocator performance in C/C++ compilers rosinowski@gmx.de (1999-01-27)
| List of all articles for this month |

From: David Chase <chase@world.std.com>
Newsgroups: comp.compilers
Date: 25 Jan 1999 21:55:32 -0500
Organization: NaturalBridge LLC
References: 99-01-081
Keywords: storage, performance

ed mckenzie wrote:


> Am I missing something? Are there other things influencing allocator
> performance besides block size, number of blocks, and allocation
> pattern? Or is VC++'s memory allocator just slower than the other
> compilers?


Long, long ago, this paper was written comparing various memory
allocators:


AUTHOR = "Norman R. Nielsen",
TITLE = "Dynamic Memory Allocation in Computer Simulation",
JOURNAL = cacm,
YEAR = 1977,
VOLUME = 20,
NUMBER = 11,
MONTH = nov,
PAGES = {864--873}


and I have seen little better since then (Cartesian trees have been
invented since them, there's been some more work with the buddy system
since then, there's been more measurement since then).


The short answer is that you do quite well with multiple free lists,
one per range of sizes, for small objects. For large objects (>= 4k,
that typically take enough time to initialize that you can afford to
think a little harder about their allocation) you might try something
like a buddy system allocator, or a first-fit-address-ordered free
list with coalescing on free. Little objects can be allocated in a
Big Bag Of Pages, meaning that you know their size because of the page
that they are in. The free list for big objects should NOT, NOT, NOT
be threaded through the objects themselves, on account of this leads
to atrocious VM behavior when you search the list, or when your
good-citizen program decides to free the 72Mb of storage that it
allocated 4k at a time.


When picking free lists sizes with multiple free lists, there is a
tradeoff in how you might waste memory. Given that you are allocating
a minimum of one page per list size, there is a risk of allocating an
entire page for a single objects, for several different sizes. On the
other hand, if the request is for 12 bytes, you really ought not round
it up to 64. In practice, what I have done selected by default is
powers-of-two plus the inbetween sizes if any exist. Thus, you get


      8, 16, 24, 32, 48, 64, 96, 128, 192, 256, etc


except that you need to keep in mind that you are placing these in
batches from a fixed-size page. Thus, if the page size is 4096, then
the next size in the sequence is not 384, it is 408. The reason is
that if you can only store 10 objects on a page, it should be the 10
largest possible objects, and 408 is the largest object that you can
get 10 of on a 4096-byte page.


You can also do reasonably well (based on benchmarks I did years ago
playing with this) with slightly adaptive algorithms that sample the
free-list-empty events to see if perhaps you should customize a free
list, where the adaptive algorithm performs a similar calculation; if
you find that you are allocating a great many 416-byte objects (9 per
4096-byte page) then you might consider creating a 448-byte bucket
(largest multiple of eight that goes into 4096 9 times).


Also, for small sizes, use direct table lookup to determine the free
list to use. Thus, a typical allocation looks like:


    void * alloc(unsigned int size) {
          int which_list;
          /* convert byte size to doubleword size */
          size = (size + 7) >> 3;
          if (size < BIG) {
                struct list ** ptr_to_head, *item;
                which_list = size_to_list[size];
                ptr_to_head = lists + which_list;
                head = * ptr_to_head;
                if (head != 0) {
                      * ptr_to_head = head -> next;
                      return (void *) head;
                } else { /* REFILL LIST */
                }
          } else { /* BIG */
          }
    }


So, in terms of typical instruction count, there is


    prologue
    add, shift, compare, conditional-branch, indexed-load, add, load,
    compare, condititional-branch, load, store
    epilogue


However, this code is not thread-safe. To make it thread-safe, you
either add locks in before loading "head" and storing back the new
pointer-to-head, where you use one lock per free list, or you do
something fancier. On a Pentium Pro, I am told that a bus-locking
instruction costs you about 25 cycles, so add 50 cycles to the cost of
the sequences above. Other processors might be able to do it more
efficiently (with either load-linked/ store-conditional or with one of
those combining-network busses to memory).


Also, if you have a library-aware compiler, you can inline some or all
of that code, depending upon whether the underlying allocator is
adaptive-by-size or not and whether the (often constant) allocation
size can be improved anyway. In that case, you might see an inline
allocation of the form:


                ptr_to_head = lists + Constant_determined_by_size;
                head = * ptr_to_head;
                if (head != 0) {
                      * ptr_to_head = head -> next;
                      result is (void *) head;
                } else { call refill routine }


Again, not thread safe.


David Chase
NaturalBridge LLC


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.