Shared libraries on systems without virtual memory

"Stefan Heinzmann" <stefan_heinzmann@yahoo.com>
7 Jun 2002 23:43:26 -0400

          From comp.compilers

Related articles
Shared libraries on systems without virtual memory stefan_heinzmann@yahoo.com (Stefan Heinzmann) (2002-06-07)
Re: Shared libraries on systems without virtual memory marcov@toad.stack.nl (Marco van de Voort) (2002-06-13)
| List of all articles for this month |

From: "Stefan Heinzmann" <stefan_heinzmann@yahoo.com>
Newsgroups: comp.compilers
Date: 7 Jun 2002 23:43:26 -0400
Organization: http://groups.google.com/
Keywords: linker, design
Posted-Date: 07 Jun 2002 23:43:26 EDT

Hi all,


I am musing over a way to implement shared libraries / dynamic linking
for systems without virtual memory. This sort of thing is becoming
more interesting as people attempt to run "grown-up" operating systems
such as Linux on embedded systems with cheap processors that lack an
MMU (see for example µClinux). I have an outline of an idea how this
could work which I'll present at the end of this post.


I am interested in pointers to relevant literature as I haven't found
much on the net so far. I know about Apple's solution in the 68k era,
and of course I read John's book about Linkers & Loaders. What else is
out there?


Alternatively - or additionally - you may want to comment on my
current state of mind, which goes like this:




1.Motivation


Shared libraries and dynamic linking are standard practise on today's
desktop/workstation operating systems, but not in operating systems
for embedded systems. In pure embedded systems, where you can
construct the whole application including the operating system as a
single program image which you burn into ROM, it is much more sensible
to statically link everything, because that is most likely to result
in the smallest possible code size.


However, embedded systems increasingly are being built on the basis of
"grown-up" operating systems, because that permits usage of a large
existing code base for the embedded system. An example area where this
can be particularly beneficial is in network appliances, where the
code for network services that has been written for desktop or server
machines can be lifted and used in the appliance.


At the same time the hardware for these appliances needs to be as
cheap as possible. That leads to the desire to run equivalent services
to those found on larger computers on bare-bones hardware - including
processors without MMU (hardware support for virtual memory). For
example, you want to have shared libraries and dynamic linking in
small embedded systems - in order to ease porting existing software
while minimizing memory consumption.


2.Differences between virtual memory and non-virtual memory systems


In a virtual memory system each process typically gets its own address
space. It is therefore possible to keep the static data of a shared
library separate for each process that uses it while sharing the code
- and still keep the data and the code at a fixed relative distance
from each other. Both code and static data can therefore be addressed
using PC-relative addressing. That allows placing the code/data
siamese twins at different addresses in each process, without needing
to relocate anything in the code.


In a non-virtual memory system, there's only one address space for the
whole system. If the code is supposed to be shared between processes,
but the static data is to be kept separate, you cannot have fixed
distances between them. Thus, it is customary to address code relative
to the PC and to address static data relative to a different register
(often called the static base register). In some processor
architectures with a small number of registers, this "loss" of a
register can have a noticeable impact on performance, but that's the
price you pay...


PC-relative addressing isn't really necessary in the second case
because the code of a shared library will of course be at the same
address for all processes. Instead of PC-relative addressing, absolute
addressing could therefore be used for procedure calls and jumps.


Another difference comes from repeatedly loading and unloading
applications and libraries: A non-virtual memory system will
potentially suffer from memory fragmentation more than a (paged)
virtual memory system, because the latter has the opportunity to map
pages scattered throughout the physical memory space such that a
contiguous virtual memory image results. The non-virtual memory system
can't do this, unless it has the possibility to move regions of memory
around (but imagine what that does to pointers pointing to them).


3.Imports and exports


If an application uses resources from a shared library, it imports it.
The shared library in turn exports these resources. Of course, the
shared library can in turn import other shared libraries. As long as
the import graph is acyclic, the situation is fairly simple, because
you can arrange for a loading order of the shared libraries that
guarantees that all imported symbols can be resolved at the time when
a shared library is loaded. That is, a loaded library can not import
anything from a library that gets loaded afterwards. In particular, a
shared library can not import anything from an application.
Applications and shared libraries are both a sort of module. Let me
call imports that can be resolved when the module is loaded "direct
imports" (is there any standard nomenclature for this?). In contrast,
I'll call an import from a module that gets loaded afterwards a
"pending import". Pending imports have the property that they may have
to be resolved differently for each process.


Whilst it would be nice if you could do without pending imports
(they're certainly not the normal case), they will have to be provided
if you want to support the general case, or otherwise people will get
into trouble when they try to port common software, since static
linking supports this kind of situation (although you may have to
specify a library twice on the linker command line to make this
possible).


4.Referencing code in another module


If you want to call a function in another module, you import the
function's entry point. At the time when your code is loaded - if it
is a direct import - the module is already loaded and the address of
the function's entry point is known. The loader can therefore link the
call address (patch it in your code) - that's why the loader is really
a (dynamic) linker. On many architectures, this leads to code that is
no slower than for calls within a module. If it is a pending import,
the call has to go indirectly through a pointer stored in the static
data segment of the module, for two reasons: First, the call address
is not yet known when the calling module is loaded and patching the
code cannot be done anymore after loading, second the call address may
be different depending on the process we're in and we want to share
the code between processes. On all architectures I know of, an
indirect call is indeed slower than a direct one, thus pending
function imports suffer a speed penalty and (due to the additional
pointer in the static data segment) a space penalty.


It is of course also possible that you call a function through a
function pointer. It is easy to imagine a situation where you don't
know which module the function belongs to, because you don't know
where the function pointer originally came from. Still you want to be
able to compare two function pointers for equality. That means that a
function shouldn't have more than one entry point and the function
pointer should point to this entry point. You may think that's
self-evident but in some schemes the compiler emits small code
fragments (trampolines, thunks) which eventually transfer control to
the function. It is very easy in such an environment to lose the
ability to compare function pointers.


5.Referencing data in another module


The static data memory for each module gets allocated and initialized
when the process starts up. That may be long after the module gets
loaded, and if more than one process uses the module, the data
segments exist in several copies. In theory, all the segments of all
modules belonging to a process could be allocated as a single chunk of
memory. I think that this might not be the best idea because of memory
fragmentation. I believe it would be better to allocate each module's
data segment separately. The reason is that it is more likely that the
allocator will find memory blocks that it can reuse: First of all the
blocks are smaller and second the same sizes occur more often. That's
pure speculation of my part, however.


Given this arrangement, how does access to imported data work? What
you need for each process is a table in memory with pointers to the
base of each module's static data segment. The index into this table
(I call it the module table) is a module number that is assigned by
the dynamic linker/loader. Each shared library gets its own unique
module number when it gets loaded, for example by searching the next
free number starting from 1 upwards. The application always gets
number 0 (so it is not unique). A function accesses static data by
first fetching the base pointer from the module table and then using
it to access the data. The pointer to the module table resides in a
register which only gets changed when the operating system switches
between processes.


The module table gets initialized when the process starts up and does
not get altered until the process terminates. Its size is determined
by the total number of shared libraries simultaneously loaded in the
system. A limit on the size of this table will limit the number of
shared libraries that can be loaded at any time. Since each process
needs such a table, this leads to a corresponding memory overhead per
process. However, the OS could at process creation time determine the
highest module number used by the process and use this to determine
the table size, thus making the table only as large as it needs to be
for each process, but that would make it difficult to dynamically load
modules when the process is already running (plug-ins).


Access to static data in this manner uses one additional indirection
as compared to systems with virtual memory, because the base pointer
has to be fetched from memory first. In most relevant architectures
that will be one machine instruction. For direct imports, the module
number and the data offset with respect to the segment base are
constants. The former is known when the importing module gets loaded,
so it can be fixed up by the loader. The latter is known when the
imported module gets linked, so the value can also be fixed up in the
importing module by the loader. For pending imports, neither module
number nor offset from base are known when the importing module gets
loaded so the access has to go through a pointer in the importing
module's static data segment (similar to what I described for function
calls). That adds yet another indirection level.


6. Discussion


An advantage of the described method is that each function is
responsible itself for fetching the relevant base pointer from the
import table. That means that each function can call any other
function without regard to which module it resides in - it doesn't
need to take care of adjusting any base pointers before doing the
call. That makes function pointers work naturally even for callbacks.


You may object that the additional level of indirection is hurting
performance too much. That may or may not be the case, depending on
your preferences, but given that static/global data is rather
unpopular these days anyway, particularly in libraries, for example
because of the reentrancy problems they cause in multithreaded code, I
don't mind penalizing them to some extent. Another reason why static
data is bad comes with shared libraries: In each application that uses
the library you get all the static variables for the whole library,
whether you need it or not. In a statically linked case the linker
would eliminate the unused variables.


It would be sensible to avoid the indirection for static variables
defined in the application (as opposed to those defined in a shared
library). As an extreme case consider an application that is
statically linked. It would bear the cost of the indirection in
accessing static data even though it doesn't use the shared library
system at all. This case can be optimized as follows: The module table
and the application's static data segment are concatenated into one
single chunk of memory. The application then could access its static
data using a constant offset from the base of the module table rather
than having to read its base pointer from the table.


The pointers used for handling pending imports are similar to the GOT
used in the ELF object format. They would need to be initialized on
application startup.


Since procedure calls and data accesses are different depending on
whether they are for direct or pending imports, the compiler and the
rest of the tools need to know which is which. That means you'd have
to use attributes similar to Microsoft's declspec(...) to specify how
you want it to be. Of course for just getting something ported you
could default to "pending", thereby accepting the space and time
overhead in order not to have to tweak the code. Or you could default
to "direct" and see where it fails, so that you know where you need
the "pending" attribute.


The scheme requires that some real linking is done when a module gets
loaded. That involves symbol lookup which means that the files will
have a symbol table (which needs space) and the dynamic linker will
need some time for actually performing the link, which may delay
program startup noticeably. Hash tables may help, but essentially this
is a price you'll have to pay.


If you want to support XIP (eXecute In Place), which means that you
have a file system implemented in a (part of) a memory device that is
mapped in the normal address space such that you can access the file
as a char array, you need to "preload" the modules so that no more
fixups are needed before they are used by an application. That
involves assigning module numbers and doing all the linking.


7.Epilog


So that was it. How wide off the mark was it? Did I explain it
alright? Where has such a scheme been implemented and what was the
experience with it? Any thoughts and comments welcome.




Finally, thanks for your patience while reading this lengthy article.


Cheers
Stefan


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.