Re: Show line numbers in diagnostics for a scripting language - how can this be done?

George Neuner <gneuner2@comcast.net>
Tue, 02 Nov 2010 17:18:11 -0400

          From comp.compilers

Related articles
Show line numbers in diagnostics for a scripting language - how can th schaub-johannes@web.de (Johannes Schaub \(litb\)) (2010-10-29)
Re: Show line numbers in diagnostics for a scripting language - how ca idbaxter@semdesigns.com (Ira Baxter) (2010-11-01)
Re: Show line numbers in diagnostics for a scripting language - how ca gneuner2@comcast.net (George Neuner) (2010-11-02)
Re: Show line numbers in diagnostics for a scripting language - how ca schaub-johannes@web.de (Johannes Schaub \(litb\)) (2010-11-15)
Re: Show line numbers in diagnostics for a scripting language - how ca bc@freeuk.com (BartC) (2010-11-06)
Re: Show line numbers in diagnostics for a scripting language - how ca gneuner2@comcast.net (George Neuner) (2010-11-09)
| List of all articles for this month |

From: George Neuner <gneuner2@comcast.net>
Newsgroups: comp.compilers
Date: Tue, 02 Nov 2010 17:18:11 -0400
Organization: A noiseless patient Spider
References: 10-10-038
Keywords: code, debug
Posted-Date: 03 Nov 2010 23:44:25 EDT

Rearranged a bit for clarity,
On Fri, 29 Oct 2010 17:22:25 +0200, "Johannes Schaub (litb)" <schaub-johannes@web.de> wrote:




>I'm using LLVM, and I'm writing a backend for a closed-source compiler of a
>language.
> :
>The runtime library will check whether both passed values could be added
>(for example a string and an int can be added, by converting the int to a
>string first). If the values can't be added, it will emit a diagnostic.
>
> ; allocate three values. Two for the operands and one for the result.
> ; all three are of type "myvalue".
> %num1 = alloca %myvalue
> %num2 = alloca %myvalue
> %result = alloca %myvalue
> ; evaluates 42 + 33. first store the values into the stack
> store %myvalue { %type* @inttype, i32 42 }, %myvalue* %num1
> store %myvalue { %type* @inttype, i32 33 }, %myvalue* %num2
> ; then let the runtime lib calculate it
> call void @rtEmitAdd(%myvalue* %num1,
> %myvalue* %num2,
> %myvalue* %result)
>
>My problem is now - in the AST, I know what nodes correspond to what source
>lines and even what source columns. But I want to display that information
>(or at least the line-number information) in the diagnostic too.
>
>What are the usual ways to solve this? I have thought abou this, and one way
>could be to pass the line number along to the runtime functions like the
>following
>
> call void @rtEmitAdd(i32 3, ; appeared in line 3
> %myvalue* %num1,
> %myvalue* %num2,
> %myvalue* %result)
>
>But this is not satisfactory, because I want to not pay the cost of pushin
>the integer along, just for the few cases where I would need to diagnose an
>error. Then I thought about letting the runtime library's direct entry point
>(i.e in the above case, it would be the function "rtEmitAdd") using the
>built-in (GCC has this one) __builtin_return_adress to get the PC of the
>call instruction, and then search debug-information for the corresponding
>line number. This seems like a bit of work to realize.
>
>I wonder now - how is this generally solved for such languages? Thanks in
>advance!
>[There's no generally satisfactory approach. One thing I've done is
>to embed the line numbers and routine names in no-op instructions
>after each call, so I can get the info via a stack traceback, but not
>affect runtime very much. Searching the debug symbols is not unreasonable;
>it's slow, but it doesn't matter since it only needs to be fast enough to
>display the results to a human user. -John]




It isn't clear to me whether your intent is to provide a compiler
diagnostic or a program diagnostic. I'm asking because the function
in your example - @rtEmitAdd() - is named as if it is meant to be a
code generator.


I (think I) understand that your data is to be latently typed and so
operations must check types for compatibility ... but what isn't clear
is whether your function @rtEmitAdd() :
    - validates arguments and performs the add,
          (i.e. the function itself is a language primitive)
    - generates an threaded call to a library subroutine primitive
    - generates inline checked code which will both validate arguments
          and perform the add
    - something else I haven't thought of?


As you can imagine, the differences among these options greatly affect
what is sensible to do to enable debugging.


From the compiler's perspective a function is being called ... but if
one assumes code generation, then from the perspective of the compiled
program there eventually could be anything from a single CPU
instruction to a small inline subroutine. In particular, the result
may represent only part of a complete language level expression.


It doesn't make a lot of sense to try instrument a program at the
instruction level ... even when the "instructions" really are threaded
calls to subroutine language primitives.




As John said, there is no good way - and certainly no portable way -
to figure out where you are in a program ... many CPUs don't provide
an easy way to get the value of the instruction pointer without
executing the moral equivalent of a subroutine call. As you
surmised, the simplest way to find where you are is to execute a
function call (or, alternatively, a software interrupt), use the saved
return address to locate the call site and go from there.


But this brings me back to my original question. If @rtEmitAdd() is
some kind of code generator function, then by locating its call site
you are debugging your compiler. If, OTOH, your intent is to provide
debugging for the user's program, then locating a call to a compiler
function won't help at all.




This is just my opinion, but I think embedding information at a
function call site really is only a viable option when function calls
are relatively rare as compared to inline code. In situations where
functions calls are very common - such as threaded code - or are
likely to be inlined, the routine instrumenting of call (or expansion)
sites can quickly overwhelm the code.


For threaded or inline code, it makes more sense to construct a table
mapping instruction address ranges to source level expressions. For
simplicity, it helps to consider an instruction's address to be the
address of the following instruction (as if each instruction were a
function call). For threaded code, which actually *is* a function
call, an error handler can be passed the return address from the
primitive. For inline code, a called error handler can directly use
its own return address (even if the handler never will actually return
to that spot).






Following are a few points you likely already are aware of, but just
bear with me for a moment.


For languages with latent types - e.g., Lisp - where in general it is
necessary to perform runtime type checks, a common implementation
technique is to compile the user program to (mostly) threaded code and
provide a library of primitive subroutines.


In natively compiled Lisps, primitives generally have both checked
(slow) and unchecked (fast) entry points. Optimizing compilers
perform type tracking and code path analyses to determine where
unchecked primitives can safely be used.


Optimizing compilers typically do generate a fair amount of inline
code - mainly for argument passing and for very common operations. For
example, a basic 2 operand arithmetic operator may be inline coded for
the assumption that the arguments will be (register sized) small
integers. If, at runtime, the inline type check reveals something
other than small integers, the code will call the general subroutine
that handles other data types and/or errors.


In GC'd languages it is fairly common for functions to have an
associated map or special purpose GC subroutine that enable the
collector to find pointers while the function is in scope. One simple
method for doing this is to embed a pointer to the thing directly
after the first instruction (which is an unconditional jump over the
pointer). This always puts the thing at a known offset from the
starting address of its associated function. Another, slightly more
complicated, method is to construct a hash table which maps function
start addresses to their associated thingamies.






If you do decide to try embedding data at call sites - particularly
data that is just for debugging - I would try as much as possible to
compact and minimize it. Debugging isn't time critical so it should
be acceptable that there is some decoding effort. For example, if the
compilation unit is a single source file, I would not bother with the
column (if known) and restrict the line number to 16-bits (almost
always more than adequate). Similarly, if a "module" is built from
multiple source files, I would collect the names into a table and pack
a table index and line number into a single value.




Right now I'm still a bit confused as to what you really are trying to
do. If you can better explain it, I might be of more help (or not 8-)
George


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.