Related articles |
---|
Survey of intermediate program representations? babic@cs.ubc.ca (Domagoj Babic) (2005-09-02) |
Re: Survey of intermediate program representations? koodailar@gmail.com (koodailar@gmail.com) (2005-09-14) |
RE: Survey of intermediate program representations? naveens@noida.hcltech.com (Naveen Sharma, Noida) (2005-09-15) |
Re: Survey of intermediate program representations? domagoj@engineer.com (Domagoj Babic) (2005-09-17) |
Re: Survey of intermediate program representations? sgganesh@gmail.com (Ganny) (2005-09-22) |
Re: Survey of intermediate program representations? manish.verma79@gmail.com (manish.verma) (2005-09-22) |
Re: Survey of intermediate program representations? DrDiettrich@compuserve.de (Hans-Peter Diettrich) (2005-09-25) |
Re: Survey of intermediate program representations? DrDiettrich@compuserve.de (Hans-Peter Diettrich) (2005-09-27) |
From: | "Ganny" <sgganesh@gmail.com> |
Newsgroups: | comp.compilers |
Date: | 22 Sep 2005 23:46:55 -0400 |
Organization: | http://groups.google.com |
References: | 05-09-01405-09-062 05-09-076 |
Keywords: | analysis |
Posted-Date: | 22 Sep 2005 23:46:55 EDT |
Hi Domagoj,
I wrote an article on this subject sometime back, that I left
incomplete and unpublished. Pl check if this answers your questions.
Others, pl feel free to give me your feedback, corrections and
suggestions.
Thanks!
-Ganesh
Intermediate Languages
Intermediate languages are familiar to compiler writers for quite a
long time. Most of the compilers employ them in one-way or other. The
way of using intermediate languages have also sometimes contributed to
the success of the languages like Pascal, Java and recently C# (with
.NET initiative in general). Even some tools like MS-Word makes use of
an intermediate language (p-code). Intermediate languages have made a
big difference in achieving portability of code. This article explores
the concept of intermediate languages and their use in achieving
complete portability and language interoperability.
In 1950s itself, this idea of having an intermediate language was
experimented with the introduction of UNCOL [1]. Later Pascal became
widely available because of p-code. Not until recently did Java employ
the same technology with its promise of portability to become a big
success. Now Microsoft's .NET initiative is also based on this
approach, but the keyword is language interoperability. A closer look
reveals that all are based on same idea: use of intermediate code, and
that is the current trend in the language translation technology.
Approaches in Language Translation
Two major approaches towards language translation are compilation and
interpretation. In compilation, since the native code is generated,
different compilers are needed for different platforms. An interpreter
is also particular to the platform, but its specifics are abstracted
making only the behavior of the program transparent as the code is
translated and executed 'on-the-fly'.
Neither of them is 'the' best approach - both boasts of significant
advantages and also suffer from some serious disadvantages.
The advantages of compilation mostly overweigh its disadvantages and
that is why it is preferred in most of the languages. The translation
and optimization needs to be done only once, and native code ensures
efficiency of execution. However as native code is generated, it is
platform specific.
Interpretation has some very good advantages that compilation doesn't
have. In particular, they can achieve full portability. Also it can
provide better interactive features and diagnostic (error) messages.
But one problem that cripples interpretation is performance. Since
translation needs to be done every time (and for the same reason,
optimization is less attractive) execution becomes generally very slow
(10 to 100 times) compared to the equivalent compiled code.
You can think of an interpreter as the one that implements a virtual
machine. So, it can be inferred that the 'machine language' of that
virtual machine is the high-level language for which the interpreter is
written!
Interpreters provide many advantages including greater flexibility,
better diagnostics and interactive features like debugging. True power
of interpretation is generally underestimated. To illustrate, you can
think of printf library routine in C as a tiny interpreter for
formatted output - depending on the requirement it parses the format
string, gets the arguments, computes the format on the fly, prints it
to the output along with facilities like padding, aligning, specific
notations etc.
Interpretation allows the code to be generated or modify itself "on the
fly" and execute it. For example, interpreted languages like Lisp or
even partly interpreted languages like Java or C# provides such
facilities (although not recommended to use such facilities for
practical programming). Note that few language features like reflection
are only possible with interpretation. This is the reason why compilers
cannot be written for some languages and interpretation is the only
option.
So, instead of restricting to pure compilation or interpretation, you
can infer that a combination of the both is better to enjoy the
advantages of these approaches.
It seems to be very simple to use both compilation and interpretation.
For that itself, various languages use different approaches.
Few languages provides options for both interpretation and compilation
and can be used depending on the requirement. For example, for
development and debugging, Microsoft Visual Basic uses an interpreter
and for deployment, it employs a compiler.
As already stated, writing compilers for the languages that are meant
for interpretation is sometimes not possible. In such cases, it is
possible to bundle the source code with the whole interpreter as an
executable file and distribute it. Java Native Interface (JNI) follows
a similar approach by embedding a Java interpreter in an executable
code created by a C/C++ source code, and run the Java code through it.
This solution is feasible because the Java interpreter's size is not
very big.
A very interesting approach towards combining compilation and
interpretation through hardware support was widely used in as early as
1960s in IBM chips. IBM released chips that incorporated the
interpreter software burnt into chips (nothing but firmware) that
supported a general instruction set. The main advantage being that the
same instruction set (of the interpreter) shall be used across
different processor architectures. Since, the compilers would generate
the code just for that one instruction set, the effort is minimized.
Later, the advent of VLSI design with RISC architecture obviated this
microprogramming approach.
The very popular and widely used approach that combines the compilation
and interpretation is making use of intermediate code directly. Since
its use in Java this approach seems to have received widespread
acceptance, even though its idea was used long back in Pascal and in
many other languages. Languages following this approach are sometimes
referred to as p-code languages since p-code of Pascal is one of the
first intermediate codes to follow this approach successfully. Few of
important p-code languages are Python, Perl, Java and now, C# (.NET
framework). The idea is to use a compiler to generate the intermediate
language code. The intermediate code can be moved on to other platforms
or even across the network where an interpreter (also called as virtual
machine) for that platform will take over to interpret the code.
The usage of intermediate code in compilers is not a new idea. There
are various phases in a compiler and they can be grouped generally as
the front-end and the back-end. The analysis and translation of the
code (in general 'parsing') is done by the front end and the synthesis
of the code is taken care by the back-end. The information in the form
of intermediate languages is represented that is generally independent
of any particular architecture and thus serves as an interface to
transfer the information from front-end to the back-end. The advantage
with this approach is that, the front-end can be same and only the
back-end needs to be changed to support new architectures and hence the
effort is limited only in rewriting the back-end of the compiler.
Intermediate languages also help in general optimization of the code
that are not particular to any platform and provides many related
advantages. So most of the compilers employ intermediate code.
If intermediate languages have been in usage for a long time in
compilers, why is this renewed interest?
The difference is that previously most of the compilers hid the use of
intermediate languages as 'implementation details'. But now, to
make the best use of them, they are extended, exposed and used in
conjunction with interpretation.
An advantage with the intermediate language approach is 'binary code
portability', i.e., the next level of portability promised by
languages like C/C++ that provides source code portability. As long as
an interpreter is there to understand and execute the intermediate
code, no recompilation is necessary to move the code to a new platform.
It also provides the ability to verify code and thus type-safety can be
ensured and as a result, ability to build secure systems on that.
Intermediate languages can also provide the advantage of 'language
interoperability'. As more than one language can be converted to the
same intermediate language, the code compiled through two different
high-level languages can interact at intermediate code level agnostic
of the different possible source languages. A related benefit is that
the number of translators that needs to be written is also drastically
reduced. This is a significant benefit as the cost of retargeting the
translator is high and writing a new one is very tedious. As a result
only a few translators need to be developed.
To illustrate, in traditional compilation approach, for N high-level
languages and M target machines, N * M translators are required. With
intermediate language approach, where the IL is capable of supporting N
high-level languages and M target platforms, then the number of
translators required effectively reduces to just N + M.
Apart from portability and interoperability, there are many features
that make the use of intermediate languages attractive. One of the
advantages that seems to be insignificant at the first look is that the
file size tends to be small compared to the equivalent native code.
That is one of the reasons why intermediate code approach was popular
around 1980s when disk and memory space was precious. The small size of
the code is advantageous even now: in the networked environment, the
small size of applets (nothing but Java class files) makes their
transmission across the network faster.
Approaches in Representing Intermediate Code
A challenge for an intermediate language is to satisfy seemingly two
contradicting goals:
1) to represent the source code without significant loss of information
2) to keep only the necessary information so that the source code can
be translated directly and efficiently to the target machine code
later.
Depending on the level of abstraction of the intermediate language, we
can classify them as high-level, medium-level or low-level ILs.
Usually the high-level ILs are represented as trees or DAGs (Directed
Acyclic Graphs) and sometimes as stack based. The middle-level ILs
typically use approaches like triples and quadruples. The low-level ILs
generally use code closer to the assembly language of the target
machine.
High-level ILs take more effort to get the code converted to the native
code, but are sufficiently high-level that it can represent rich
features of various source languages directly. So it is possible to
retarget new languages to high-level ILs. However, more effort is
needed in converting the intermediate code to native code. On the other
extreme, low-level ILs make the translation to native code easier, but
it is generally hard to retarget other high-level languages than that
it is originally intended for.
Another design approach is to abstract the programming language (for
example, ANDF) or the machine (for example, RTL) and an IL has to
strike a balance between the two extremes.
There are many methods used to represent the information at
intermediate level. Some widely used approaches are discussed here.
Three-address Code
In most of the language compilers, three address codes are generated as
intermediate code and later code-generators generate code for that
particular target machine. They are mostly register based, and are
fairly low-level, so target-code generation is straightforward. A
similar approach is Register Transfer Language (RTL) widely used in GNU
C compilers. Following is an example for three address code:
R1 := R2 + R3
And the general syntax is.
X := Y op Z
Object Files
An interesting approach to represent information at intermediate level
is object files. With each platform, the format of object code file and
the information stored differs. This makes them platform dependent. An
effort has been made to combine the information that is required in
various formats. It resulted in creating a huge format that contains
all such details, and is referred to as 'fat binaries' effectively used
in NExtStep platforms. On the other hand, 'slim binaries' are also
available that generates the portable object code 'on-the-fly'.
Tree Representations
ANDF [2,3] stands for Architectural Neutral Distribution Format,
abstracts high-level language constructs. It aims at converting various
language constructs in a programming language to an intermediate form
that shall retain all the information available in the source. It takes
care to represent different language semantics, platform dependencies
etc. Languages including C, C++, Fortran, Ada has ANDF producers
(high-level compilers). This is achieved through an intermediate
representation of the language constructs (keeping its variants in
various programming languages). For example, expressions in the source
language are converted into EXPs, identifiers as TAGs, parameterization
as TOKENs etc. ANDF follows a tree representation of constructs called
CAPSULEs. It also has installers (low-level compilers) in wide variety
of platforms including Dec-Alpha, Sun-Sparc, PowerPC and Intel and
executed as native code.
Using another High Level Language!
Instead of creating a new intermediate language, this approach is
towards using a high-level language as a intermediate language that is
already available, highly portable, efficient and sufficiently
low-level. As you would have guessed, C is an interesting high-level
language with ability to write low-level code. Owing to this property,
it is sometimes referred to as a portable assembler or 'middle-level'
language. The beauty of C is that, in spite of its low-level abilities,
the code is highly portable. Due to this nature, C has effectively
served as a form of a portable intermediate language for other
high-level programming languages. In fact, in initial days of UNIX,
translators generating C code as target code was written so that
re-writing them for every platform were not required. Many languages
follow this approach, including Mercury, Scheme, APL, Haskell etc. Also
GNU Fortran and Pascal compilers (f2c and p2c) generate C code and
native C compilers take over to compile and execute them.
Stack-Based Machines
This representation assumes the presence of a run-time stack and
generates code for that. It uses the stack for evaluation of
expressions by making use of stack itself to store the intermediate
values. Thus the underlying concept is very simple and that is its main
strength. Also, least assumptions are made about the hardware and
support available in the target architecture. When a virtual machine
(runtime interpreter) simulates a stack-based machine and provides
necessary resources, this approach can effectively provide
platform-independence. In other words, the virtual machine becomes the
platform of its own that shall be simulated in possibly any hardware
platform (and thus needs to be written for those platforms). One
problem with this approach however is that the code optimization is
difficult. Java virtual machine is a very good example for such
stack-based representation.
Other major efforts include Oberon Module Interchange (OMI) [4], Juice
[5] that is based on OMI, Dynamic binding and VP code of TAO operating
system and U-Code.
Implementations Based on Stack Approach
Since stack based machines are one of the most successful ones and is
gaining widespread acceptance in recent times, let us discuss few
implementations based on that in detail here.
P-code:
P-system was popular in 1970-80s as an operating system that made use
of p-code [6]. Compilers would emit p-code, an architectural neutral
one that was executed by a virtual machine. UCSD Pascal was the most
popular compiler that generated p-code and also the whole p-system was
written in UCSD Pascal itself making it easier to port to new
architectures with much ease.
Pascal implementations contained a compiler generating p-code and later
an interpreter will take over to execute that p-code. The interpreter
was very simple and so it was easy to write an interpreter for a new
architecture. With that interpreter, a Pascal compiler in the form of
p-code can be used to compile Pascal programs that can in-turn be run
on the interpreter.
However, this steps need to be done only in initial steps. Later the
compiler can be modified to generate native code, and once, the Pascal
compiler itself is given as input for the Pascal compiler in p-code,
the compiler becomes available in native code. This process is referred
to as 'bootstrapping', that is generally used in porting compilers to
new environments. But the point here is that this 'bootstrapping'
process becomes easy and elegant with the p-code approach. This led to
the popularity and widespread use of Pascal compilers.
Java Bytecodes:
Java[7] was originally intended to be used in embedded systems and
set-top boxes and thus a heterogeneous environment. Internet was
becoming very popular at that time and there was no language that best
suited that heterogeneous environment. Java suited for that purpose
very well and thus became a sort of lingua franca of Internet.
The technology is as we already saw of intermediate languages. Java
compiler converts the source into intermediate code - bytecode and pack
with related information in class files. These class files (applets or
applications) can be distributed across the network and executed in any
platform provided that a Java Virtual Machine [8] is available there.
The advantage of this technology is, as everybody knows now, 'write
once, run anywhere'. Programmers need not worry about the target
platform. They just concentrate on the problem and develop the code.
To illustrate, for each and every class or interface defined, a class
file is generated. Every method is compiled into byte codes - the
instructions for the hypothetical microprocessor. For example:
int a = b + 2;
this may be converted to the bytecodes as,
iload_1 // push the variable 1 (int type) to evaluation stack
iconst_2 // push the constant value 2 (int type) on the stack
iadd // pop two integer values from stack, add them, push the
// result back.
istore_2 // pop the int value from the stack and store it in variable
2 // (int type)
All the information that you give inside the class definition is
converted and get stored to form an intermediate class file.
The intermediate class file format is a specification given by Sun. It
is a complex format and holds the information like the methods declared
along with the byte codes, the class variables used, initializing
values (if any), super class info, signatures of variables and methods,
etc. It has a constant pool - a per class runtime data structure, which
is similar to the symbol table created by the compiler.
Virtual machine plays a major role in Java technology. All the Java
virtual machines have the same behavior, as defined by SUN, but the
implementation differs. It forms a layer over the physical machine in
the memory.
Even though the Java technology provides the basis for achieving full
portability, Java's design aids to make it architectural neutral. To
illustrate, lets compare it with a C design consideration with that of
Java. C gives much importance to efficiency and for that, it leaves
size of its data types (except char) to be implementation-dependent.
But Java primitive types are of predetermined size, independent of the
implementation. In general, Java improves upon C by removing the
constructs having various behavioral types in C by having well defined
behavior instead.
JVMs are available in very wide variety of platforms. However, Java
class file format and bytecodes are closely tied with the Java language
semantics. So, it becomes tough to retarget other language compilers to
generate Java class files. Few of the languages including Eiffel, Ada
and Python have compilers generating class files to be executed by
JVMs. Since the class files and bytecodes are not versatile enough to
accommodate wide range of languages, class files (and bytecodes)
doesn't fare as a feasible universal intermediate language.
.NET Architecture
.NET architecture addresses an important need - language
interoperability, the concept that can change the way programming is
done and is significantly different from what Java offers. If Java came
as a revolution providing platform independence, .NET has language
interoperability to offer.
To illustrate, JVM is implemented for multiple platforms including
Windows, Linux, Solaris, OS/2 etc. But it has only few languages
targeting at JVM to generate class files. On the other hand, .NET
supports multiple languages generating code (MSIL - Microsoft
Intermediate Language) targeting Common Language Runtime (CLR). This
list includes, C#, VB, C, C++, COBOL, etc. and nevertheless this list
is an impressive one. CLR, the runtime interpreter of .NET platform
(equivalent of JVM of Java) is currently implemented only for Windows
platform and efforts are underway to implement it for Linux and other
platforms. A Java program can be compiled on a PC and can be executed
by a Mac or a Mainframe. Whereas with .NET, we can write a code in
COBOL, extend it in VB and deploy it as a component. Thus the Java and
.NET have fundamentally different design goals. However, you may have
noticed that they partially satisfy the two different requirements for
universal intermediate language: to support different source languages
and target architectures.
In .NET, the unit of deployment is the PE (Portable Executable) file -
a predefined binary standard (similar to class files of Java). It is
made up of collection of modules, exported types and resources and is
put together as an .exe file. It is very useful in versioning,
deploying, sharing, etc. The modules in PE file are known as
assemblies.
The IL is designed such that it can accommodate wide range of
languages. Also at the heart of the .NET is the type system where the
types are declared and the metadata of how to use the types is also
stored. One of the important differences between MSIL and bytecodes is
that MSIL is type-neutral. For example, iconst_0 is a bytecode that
tells to push integer value 0 on the stack, meaning that the type
information is kept in the bytecodes. But the similar IL code just
tells to push four bytes, meaning that no type information is passed
on. <<give the IL code and verify later>>
With .NET you can write a class in C#, extend it in VB and instantiate
it in managed C++. This gives you the ability to work in diversified
languages depending on your requirement and still benefit from the
unified platform to which the code for any .NET compliant language is
compiled into.
It should also be noted that each language is versatile and have unique
features and peculiarities of their own. ...
Summary
The benefits of using intermediate languages are well known, and the
current trend is towards achieving fully portable code and language
interoperability. To understand the concept of portable code
compilation, the understanding of the conventional translation
technology and their advantages/disadvantages needs to be known. There
are many issues involved in intermediate code design and understanding
them shall enable us to appreciate the various efforts to generate
portable code.
Way back in 1950s itself, the first effort towards achieving
portability was taken through UNCOL initiative. Later, during 1970-80s,
Pascal in the form of p-code picked up the idea. In 1995s Java came as
a revolution to capture the imagination of the language world. Now it
is .NET's turn with the age-old concept of language interoperability.
Thanks to such technological advances, achieving 100% portable code
with language interoperability is no more a dream. It is clear that the
concept of intermediate languages is here to stay for a long time and
we are not far from having a universal intermediate language.
References:
[1] see ftp://riftp.osf.org/pub/andf/andf_coll_papers/UNCOLtoANDF.ps
[2] see ANDF home page: http://www.andf.net/
[3] refer 'Architecture Neutral Distribution Format (XANDF)', Xopen Co.
Ltd., 1996
[4] see http://huxley.inf.ethz.ch/~marais/OMI.html
[5] see http://www.ics.uci.edu/~juice/
[6] see http://www.threedee.com/jcm/psystem/
[7] see http://java.sun.com/
[8] refer James Gosling, Bill Joy, Guy Steele, "The Java Language
Specification", Addison-Wesley, 1996
[9] refer Tim Lindholm, Frank Yellin, "The Java Virtual Machine
Specification", Addison-Wesley, 1997
[10] ECMA C# specification, Microsoft Corporation, 2001
[11] "The Development of the C language", Dennis M. Ritchie, Second
History of programming Languages conference, Cambridge, Mass. 1993
side bar 1:
Approaches in improving the performance of the intermediate code in
Java/.NET
One very obvious disadvantage of the combination of compilation and
interpretation is the speed factor. Although the code is compiled,
still an interpreter needs to be employed to execute the code and
typically the execution speed is reduced to the factor of 10 to 30
times to the equivalent of the native C/C++ code. There are many
suggested ways to improve this speed to achieve the speed of "native
code". The remaining part of the article explores the approaches that
are made to overcome this problem with Java and .NET in particular.
An interesting approach is to avoid the interpretation at the target
platform and do compilation again. This can avoid the slowness inherent
in the interpretation approach as translation is required each and
every time the code needs to be executed. So an alternative approach of
compiling the intermediate code again into native code to execute can
be thought of. This makes the life of the virtual machine tougher but
can improve the performance and it should be noted that the approach
should make sure that the platform independence is not lost for this
performance gain.
One such approach is Just-In-Time (JIT) compilation. The aim is to
compile the code when the code is called and 'boot-strap' that the
compiled native code be used the next time the code is called. Since,
in general, 90% of the time is spent in executing 10% of the code, this
approach can reap rich dividends. JIT compilers are also referred to as
"load and go" compilers as they don't write the target code into a
file. When a method is called for the first time, it is compiled into
native code and kept in memory that the compiled code may be used next
time the method is called again. Java uses this approach of compilation
on demand extensively and many of commercial JVMs are JITters. Note
that, JVMs are allowed to follow the interpretative approach or JIT and
hence it is an optional feature for performance improvement. For .NET,
the use of JIT is mandatory and thus all MSIL code is JITted to native
code before execution.
Aggressive optimizations are also possible to make that are not
possible with static optimization techniques, since:
i) precious runtime information is available for doing optimization
ii) depending on the situation and user requirement, only particular
parts of the code will be called and doing aggressive optimizations on
that part of the code and bootstrapping it shall improve the
performance considerably.
iii) since optimization is done the frequently called code, it shall
provide improved performance in general than the general optimization
without any profiler information.
However it should be noted that, the performance of JITters is still
not on par with the equivalent C/C++ code, but it is considerably
better than the equivalent interpreted code. Also, in environments like
embedded systems where RAM is precious, it may not be possible to
bootstrap the methods for later use.
Ahead of time recompilers is yet another approach for improving the
performance through native code. A compiler reads the information in
the Java class files and generates a file that contains the original
bytecodes and the native code equivalent for various platforms. The
resultant file is sometime referred to as "fat" class file (refer to
the "fat" binaries in the main text) and depending on the platform, the
native code shall be invoked. This can effectively avoid the necessity
to recompile every time the code needs to be executed and at the same
time not losing portability.
Another approach is Ahead of time compilation. When the file is loaded
into the memory itself, all the code gets compiled into native code.
The advantage with this approach is that the code shall be compiled
before any method is called, as opposed to JIT approach and hence is
sometimes referred to as PreJIT. So, it doesn't suffer the problem of
overhead with JIT compilation. This can thus effectively reduce the
time for start-up and execution, as the code is ready of execution in
form of native code.
So it seems that Ahead of time compiling is a very good alternative for
JIT, but that is not the end of the story. The virtual machine still
needs the meta-data about the classes, for example for supporting
features like reflection. If no meta-data is maintained, then
reflection cannot be supported which would be a serious disadvantage.
Moreover, if the environment or security settings change, then the code
needs to be compiled again. Thus PreJIT is not much promising approach.
.NET supports PreJIT with its support with its NGEN tool in the Visual
Studio.NET.
One of the interesting ideas to overcome the performance problem is
implementing the interpreter in hardware. That is what the JavaChip
intends to do - the bytecodes shall be directly executed by a
microprocessor designed specifically for that. Since Java bytecodes are
not too low level to be implemented at chip level, still a JVM runs
over that, with the difference that the bytecodes are now directly
executed by the hardware. This idea too, is reminiscent of the old idea
in IBM machines that is discussed in the main text.
= = = = = = = = = = = = = = = =
sidebar - 2
C - A retrospect on its evolution and design decision
C is known for its high performance through generation of native code.
Its interesting to note that its predecessor B didn't generate native
code, instead it generated a stack based intermediate code called as
threaded code. When Ritchie wanted to write his operating system with
his new language, he found that it was handicapped to use the
interpreted approach. To overcome this problem, Thomson tried by
offering a 'virtual B' compiler that still had an interpreter, but it
was too slow for practical purposes. So, to make C sufficiently low
level that it can be used to write operating systems and compilers,
Ritchie abandoned threaded code approach and instead generated native
code. So this drastic change in the translation approach itself was
influenced by two main problems with interpretation: the runtime
overhead of interpretation and performance. In retrospect, had C
continued the B's tradition of interpretive approach, would it have
become such a huge success?
Return to the
comp.compilers page.
Search the
comp.compilers archives again.