Re: Show line numbers in diagnostics for a scripting language - how can this be done?

George Neuner <gneuner2@comcast.net>
Tue, 09 Nov 2010 17:20:15 -0500

From comp.compilers

Related articles
Show line numbers in diagnostics for a scripting language - how can th schaub-johannes@web.de (Johannes Schaub \(litb\)) (2010-10-29)
Re: Show line numbers in diagnostics for a scripting language - how ca idbaxter@semdesigns.com (Ira Baxter) (2010-11-01)
Re: Show line numbers in diagnostics for a scripting language - how ca gneuner2@comcast.net (George Neuner) (2010-11-02)
Re: Show line numbers in diagnostics for a scripting language - how ca schaub-johannes@web.de (Johannes Schaub \(litb\)) (2010-11-15)
Re: Show line numbers in diagnostics for a scripting language - how ca bc@freeuk.com (BartC) (2010-11-06)
*Re: Show line numbers in diagnostics for a scripting language - how ca gneuner2@comcast.net (George Neuner)* (2010-11-09)**

| List of all articles for this month |

From:	George Neuner <gneuner2@comcast.net>
Newsgroups:	comp.compilers
Date:	Tue, 09 Nov 2010 17:20:15 -0500
Organization:	A noiseless patient Spider
References:	10-10-038 10-11-005 10-11-013
Keywords:	code, debug, interpreter
Posted-Date:	09 Nov 2010 18:10:42 EST

On Fri, 05 Nov 2010 16:33 +0100, "Johannes Schaub (litb)"
<schaub-johannes@web.de> wrote:

>George Neuner wrote:
>
>> Rearranged a bit for clarity,
>> On Fri, 29 Oct 2010 17:22:25 +0200, "Johannes Schaub (litb)"
>> <schaub-johannes@web.de> wrote:
>>
>>>The runtime library will check whether both passed values could be added
>>>(for example a string and an int can be added, by converting the int to a
>>>string first). If the values can't be added, it will emit a diagnostic.
>>>
>>> ; allocate three values. Two for the operands and one for the result.
>>> ; all three are of type "myvalue".
>>> %num1 = alloca %myvalue
>>> %num2 = alloca %myvalue
>>> %result = alloca %myvalue
>>> ; evaluates 42 + 33. first store the values into the stack
>>> store %myvalue { %type* @inttype, i32 42 }, %myvalue* %num1
>>> store %myvalue { %type* @inttype, i32 33 }, %myvalue* %num2
>>> ; then let the runtime lib calculate it
>>> call void @rtEmitAdd(%myvalue* %num1,
>>> %myvalue* %num2,
>>> %myvalue* %result)
>>>
>>>My problem is now - in the AST, I know what nodes correspond to what
>>>source lines and even what source columns. But I want to display that
>>>information (or at least the line-number information) in the diagnostic
>>>too.
>>
>> It isn't clear to me whether your intent is to provide a compiler
>> diagnostic or a program diagnostic. I'm asking because the function
>> in your example - @rtEmitAdd() - is named as if it is meant to be a
>> code generator.
>
>I'm sorry for the confusion. My primitive functions seem to be badly named.
>My intent is to provide program diagnostics - "rtEmitAdd" should really be
>called "evalAdd", I think. My implementation can be used for both native
>code generation and JITed execution (using LLVM).
>
>> I (think I) understand that your data is to be latently typed and so
>> operations must check types for compatibility ... but what isn't clear
>> is whether your function @rtEmitAdd() :
>> - validates arguments and performs the add,
>> (i.e. the function itself is a language primitive)
>> - generates an threaded call to a library subroutine primitive
>> - generates inline checked code which will both validate arguments
>> and perform the add
>> - something else I haven't thought of?
>
>The function should validate arguments and then perform the add, storing the
>result in the passed stack slot address (Some background: one "myvalue"
>object is approx. 24 bytes long - storing ints and floats in-place and
>keeping a pointer to data for arrays and strings. Arrays and strings are
>ref-counted. Later, a value can have a unit, so space is reserved for a unit
>already).
>
>I wasn't sure whether it's a good idea to have actual functions for solving
>stuff like a single multiplication. But it seems from your description other
>language implementations do that as well. So I think I'm not all wrong about
>my approach?
>
>I don't understand the "generates a threaded call to a library subroutine
>primitive". I probably misunderstand it - has it anything to do with the OS-
>level threads? (i.e parallel execution? Makes little sense to me).

Sorry. The term "threaded" is overloaded ... I should not have
expected you to know it. Historically "threading" referred to passing
data through a series of massaging functions - a process which was
likened to "stringing" beads. Although the "stringing" notion implies
an order of execution and thus a "control path", in the early days
almost all programming was serial (same as today 8-) and so the
control path meaning wasn't much discussed until CSP (Hoare 1978) made
people more generally aware of multitasking.

There are a number of variants of threaded code, but in the most
basic version there is a runtime library containing subroutine
"primitives" which implement the basic operations of the language.
User programs and higher level library functions are compiled into a
series of calls to the primitives.
see http://en.wikipedia.org/wiki/Threaded_code

>In my implementation, I use runtime library functions for doing additions
>and such a lot (like rtEvalAdd, rtEvalSub, rtEvalSubscript, etc..).

In most threaded code, a primitive call is akin to invoking an
operator and so, from the viewpoint of the user program, primitive
calls may be considered inline code rather than function calls. This
varies by language tradition: Lisp and Forth consider [nearly]
everything to be a function call whereas threaded Algol, COBOL, BASIC,
etc. implementations typically distinguished between primitive
"operator" calls and calls of compiled user or library functions.

What initially confused me was your function naming scheme in
combination with your comments about passing/embedding source
locations vs stack walking - the issue being whether the function
calls you were trying to locate actually were part of the compiler or
part of the user program.

>> In natively compiled Lisps, primitives generally have both checked
>> (slow) and unchecked (fast) entry points. Optimizing compilers
>> perform type tracking and code path analyses to determine where
>> unchecked primitives can safely be used.
>>
>
>Ah, in my compiler's compile-time semantic analysis, I already gain lots of
>type information. For example, it knows that "(1, 1) * 1.1" has type
>"(float, float)" (1 * 1.1 -> float, 1 * 1 would be int). It does not do any
>elaborated analysis yet, though (mostly because I've no clue how to write
>them xD). I decided to tackle on that later on, perhaps doing it in a way
>that makes use of LLVM's built-in analysis functionality.

I mentioned this because your example showed all the values as being
of type "%myvalue" and it looked to me as if you might be using latent
data typing in a manner similar to that in Lisp.

In Lisp, values have types but variables are untyped bindings which
assume the type of the currently bound value [not literally true but
close enough for discussion.] Because a variable may be bound to a
value of any type, functions generally must verify the types of their
arguments before using them and every type (including user defined)
has a predicate (test function) to identify it. But there are many
cases where a type has already been determined and can be relied upon
so long as the program remains on a particular path.

The following is adapted from examples in Olin Shivers's paper
"Data-flow Analysis and Type Recovery in Scheme".

1 (let* ( (a (car b))
2 (q (char->integer (string-ref a i)))) )
3 :
4 (set! c (cdr b))
5 (if (pair? c)
6 (string-set! a i (car c))
7 (string-set! a i c))
8 :
9 (+ q 2))

In line 1 above, the CAR (first) function must test its argument 'b'
to ensure it is a list pair. Barring a rebinding of 'b', having
passed the test in CAR, the CDR (rest) function in line 4 can assume
'b' is a list pair and does not need to test it again.

Similarly, in line 2, STRING-REF requires a string and an integer
index and CHAR->INTEGER takes a character and returns an integer. So
if 'q' is successfully bound, 'a' can be assumed to be a string and
both 'q' and 'i' can be assumed to be integers.

The SET! in line 4 rebinds 'c', invalidating anything known about the
type of 'c' up to that point.

The STRING-SET! calls in lines 6 and 7 can assume 'a' and 'i' are
valid, but both must test if their 3rd argument is a character.
However, if the test in line 5 succeeds, then 'c' is known to be a
list pair in the "true" branch of the IF and so the CAR in line 6 does
not need to test 'c' again.

And finally, the + operator in line 9 does not need to test if 'q' is
a number.

These kinds of type tracking path analyses are performed by optimizing
Lisp compilers to try to remove as many redundant type checks as
possible while still creating "safe" code.

>> Optimizing compilers typically do generate a fair amount of inline
>> code - mainly for argument passing and for very common operations. For
>> example, a basic 2 operand arithmetic operator may be inline coded for
>> the assumption that the arguments will be (register sized) small
>> integers. If, at runtime, the inline type check reveals something
>> other than small integers, the code will call the general subroutine
>> that handles other data types and/or errors.
>
>Ah, I will take this into account. Though I wonder if I do this "check for
>simple same-types and do the operation inlined, or otherwise pass on to the
>general subroutine", whether it wouldn't yield to very large generated code?
>I know only very little about x86, so I'm not sure whether this "inline-or-
>call-generic-function" really pays off compared to "always-call-generic-
>function". Especially since I plan to later link the object file containing
>all the general functions (like rtEvalAdd) to the program's object file and
>doing link time optimizations when generating native code.

The answer is "it depends" ... on the CPU, on whether and how
immediate types are tagged.

Small integer arithmetic is the thing most often inlined because most
program integer values are small and array indexing depends on them.

Lisp tags immediate values - small integers, characters, pointers,
etc. - by stealing a few bits. This reduces the range of pointers and
small integers slightly - not really an issue with 64-bit registers
and Lisp will seamlessly switch to using multiword "big" integers if
small integers overflow.

The details of tagging and tag checking vary by implementation.

One common scheme is to use 2 or 3 low bits for the tag and use a tag
of zero to represent integers. This shifts the value upward by the
number of tag bits but does not otherwise change the value. With this
scheme addition and subtraction just work, multiplication, division
and modulus (remainder) require one extra shift.

So assuming 3 bit tags:

          a = ? << 3
          b = ? << 3

          r = a + b
          r = a - b
          r = ((a >>3) * b) OR (a * (b>>3))
          r = (a / b) << 3
          r = (a % b) << 3;

The integer type check is ((a | b) & 0x7) == 0.

Ignoring how operands get loaded and the result gets stored, the
inline multiplication sequence in x86 looks something like:

              ; assumes EAX = a EBX = b
              ; leaves result in EAX

              mov ECX,EAX ; copy of a

              or ECX,EBX ; check tags
              and ECX,0x7
              jnz generic

              shr EAX,3 ; do multiply
              mul EBX
              jmp done

generic:
              call doMult ; EAX = a EBX = b
done:

I'm not an x86 whiz, but If I've figured this correctly, it looks like
7 instructions to do a multiply ... that's what happens when you don't
have static types. The code would be longer if the type tags were
stored separately from the values.

The problem with the x86 vs other chips is that it has instructions
that require certain dedicated registers, and that it uses 2-address
code which forces most instructions to overwrite one of their operands
- forcing a copy to be made if the operand needs to be preserved. This
particular sequence isn't bad - only one copy is necessary - but if
the operand were in other registers or the operation more complex more
contortions might be necessary

>> In GC'd languages it is fairly common for functions to have an
>> associated map or special purpose GC subroutine that enable the
>> collector to find pointers while the function is in scope. One simple
>> method for doing this is to embed a pointer to the thing directly
>> after the first instruction (which is an unconditional jump over the
>> pointer). This always puts the thing at a known offset from the
>> starting address of its associated function. Another, slightly more
>> complicated, method is to construct a hash table which maps function
>> start addresses to their associated thingamies.
>>
>
>I don't understand this. What pointers will the GC map and find?

The garbage collector needs to be able to identify any and all
pointers into the heap so that it doesn't accidentally recycle memory
that still is in use. But as the program is running, some of those
pointers may be in local variables on the stack or even in registers.
One way to identify pointers is to tag them ... and handle them in
similar ways to the integers we discussed above, though obviously a
different tag value would be used so as to distinguish pointers from
other data.

A second way is to provide some kind of a map or descriptor indicating
where pointers are to be found embedded within a larger object, such
as a structure or a stack frame. In OO languages, objects routinely
have embedded in them a pointer to class information - which often
includes a description of the physical layout of the object.

The third way to identify pointers is to provide a function that
simply knows where they are. You can do this whenever an object's
physical layout can be fixed at compile time - which can be done with
a function's stack frame. When the compiler lays out a function's
stack frame, it can generate a companion "finder" function which knows
precisely which slots in the frame contain pointers.

[Register based pointers are trickier to deal with. For CPUs with
lots of general registers, often a certain block of registers can be
dedicated to holding pointers that the GC needs to know about. On
register starved insane architectures like x86(-64), GC'd pointers
pretty much have to be made available on the stack.]

How the GC would use a pointer map or interact with a "finder"
function is beyond the scope of this discussion ... what is important
here is simply how the GC locates the ones to use given the particular
state of the stack.

In the context of a particular stack frame, the obvious way to
identify which function constructed it is to:
      - use the saved return address to locate the call site
      - retrieve the function's address from the call instruction

Given the address of a function, you now need a way to located any
items associated with it. As I said previously, one good way to make
the association is to embed the items or pointers to them immediately
following the start of the function. Your compiled function then
looks like this:

MyFunction:
              jmp go

.DATA ; this is a block of data which could be
                      ; pointers to other functions or maps or
                      ; anything at all.

go: ...
            :

If you arrange that every compiled function has the same layout - in
particular always using the same type of jump instruction so that you
know its size - then given any function's starting address you can
find its associated items by adding a constant offset.

Another way to make the association is to create a table in which the
GC (or whatever) can look up items given the function's starting
address. This is more complicated than the direct embedding method
because the compiler must not only create the maps/finders/data blocks
etc., but then must also construct a lookup table - or code to create
one at runtime - which maps function addresses to the associated
items.

>Thanks for the help so far! These points contain good insight.

I'm happy to help. I just hope I'm not confusing you more. Mixing
multiple trains of thought in a single post is sort of inevitable as a
good discussion gets rolling, but sometimes they get too hard to
follow and have to broken out separately. Often its hard to know if
you're reaching that point before you've already gone past it.

George

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Show line numbers in diagnostics for a scripting language - how can this be done?

George Neuner <gneuner2@comcast.net>Tue, 09 Nov 2010 17:20:15 -0500

George Neuner <gneuner2@comcast.net>
Tue, 09 Nov 2010 17:20:15 -0500