speeding up C recompilation, was Re: Bison deterministic LALR

BGB <cr88192@hotmail.com>
Tue, 04 Sep 2012 13:45:51 -0500

          From comp.compilers

Related articles
=?UTF-8?Q?Bison_determinis=E2=80=8Btic_LALR=281=29_parser_for_Java=2FC hsad005@gmail.com (2012-08-17)
Re: Bison =?UTF-8?B?ZGV0ZXJtaW5pc+KAi3RpYyBMQUxSKDEpIHBhcnNlciBm?= =?U DrDiettrich1@aol.com (Hans-Peter Diettrich) (2012-08-18)
Re: Bison =?UTF-8?B?ZGV0ZXJtaW5pc+KAi3RpYyBMQUxSKDEpIHBhcnNlciBm?= =?U anton@mips.complang.tuwien.ac.at (2012-08-20)
Re: Bison =?UTF-8?B?ZGV0ZXJtaW5pc+KAi3RpYyBMQUxSKDEpIHBhcnNlciBm?= =?U cr88192@hotmail.com (BGB) (2012-08-21)
=?UTF-8?Q?Re:_Bison_determinis=E2=80=8Btic_LALR=281=29?= =?UTF-8?Q?_pa bc@freeuk.com (BartC) (2012-08-22)
Re: Bison =?UTF-8?B?ZGV0ZXJtaW5pc+KAi3RpYyBMQUxSKDEpIHBhcnNlciBm?= =?U cr88192@hotmail.com (BGB) (2012-08-26)
Bison deterministic LALR parser for Java/C++ bc@freeuk.com (BartC) (2012-08-29)
speeding up C recompilation, was Re: Bison deterministic LALR cr88192@hotmail.com (BGB) (2012-09-04)
| List of all articles for this month |

From: BGB <cr88192@hotmail.com>
Newsgroups: comp.compilers
Date: Tue, 04 Sep 2012 13:45:51 -0500
Organization: albasani.net
References: 12-08-005 12-08-006 12-08-009 12-08-014 12-08-015 12-08-018 12-08-020
Keywords: parse, C
Posted-Date: 04 Sep 2012 18:41:20 EDT

On 8/29/2012 4:03 PM, BartC wrote:
> "BGB" <cr88192@hotmail.com> wrote in message


>> On 8/22/2012 8:04 AM, BartC wrote:
>
>>> But even given all that, there are ways of dealing with huge header
>>> files so
>>> that it is not necessary to repeatedly tokenise and parse the same
>>> headers
>>> over and over again (for recompiling the same module, or compiling many
>>> modules all sharing the same headers).
>>>
>>> I've no idea whether many C compilers actually bother though; perhaps
>>> it's easier to just recommend a faster computer..
>>
>> the problem here is that, although it isn't too hard to figure out
>> possible optimizations, it is much harder to make them work in ways
>> which don't violate the C standard.
>>
>> another issue is that things like precompiled headers are non-standard,
>> and there is no real agreed-on convention for "hey, compiler, feel free
>> to use precompiled headers here".
>
> Why should how a compiler optimises its work violate the standard?
> Provided the end results are the same, the compiler can do what it
> likes.


basically, because the C compilation process is difficult to optimize
while not changing behavior in various ("minor") ways.


a major one is the preprocessor, where:
defines in one header may subsequently cause different expansions in
another;
pretty much anything that can be done via textual substitution is
allowed by the preprocessor;
there are macros and built-in defines which may have time-dependent or
invocation-dependent behaviors;
...


and, the standard requires that all this sort of stuff works as-expected.




> However the C language and the way the C compilers are typically
> invoked (for example just one-time to compile one module, so it's
> forgotten it's compiled the same sets of headers a moment before)
> doesn't make things easy. And it's possible that things such as
> __TIME__ have been used in such a way that you are obliged to
> recompile a header each time.


yeah.


or things as simple as:
#define WIN32_LEAN_AND_MEAN
#include <windows.h>


where basically the presence of this define does itself alter the
behavior and contents of the header.




a partial optimization though is to optimize for the case where all the
modules in a library are compiled in batches, and where all of them use
the same header-lines. then it is possible to preprocess/parse the
headers once, and then reuse them for subsequent modules.


I think I remember at least one compiler doing this (apparently Borland?).


MSVC basically uses the whole "stdafx.h" thing, where they precompile
this and have (sometimes "force") everything to include it.


GCC apparently only allows it for the case of including a single header.




my considered option was basically having a way (in the headers) to tell
the compiler that it is allowed to violate the C standard in certain
ways for sake of optimizing the compilation process.




> So it's easy to see why compilers may not bother!


yeah.


the lack of a good way to approach the problem, and the relative
simplicity of the old/slow route, leaves the old/slow strategy as the
main way most code is compiled.




> Nevertheless, I think there is plenty that can be done, although I'm
> not sure that creating intermediate files such as precompiled headers
> is the way to go. It's better when a compiler is properly integrated
> into an IDE, then the symbol tables built by a set of headers can be
> made persistent much more easily.
>


actually, caching all of the definitions/defines/... resulting from
processing a set of headers can be done (I have done so in the past,
partly as an extension intended to allow faster "eval" of C fragments,
and this is also partly how the FFI for my BGBScript language works).


however, compiling C code using these would itself likely violate the
standard in subtle ways, so likely couldn't be used as a general-purpose
optimization.


the most obvious way to use this in a "moderately standard" way would,
ironically, likely assume a form resembling that of the "stdafx.h" system.




> Alternatively, it might be possible to just have a very faster parser!
> And perhaps integrate the preprocessor so that it is not a separate
> pass (I haven't attempted a C compiler so not sure if that's feasible;
> my own source-level directives are handled by the lexer itself, or
> sometimes by the parser).


I am not entirely sure that this would buy much, but would likely be
more complicated (and using a single pass for preprocess+parse+lexing
could actually make the overall process slower).


a partial issue is that the preprocessor works line-by-line, but a C
parser does not.


combining them would likely lead to a case where there were mismatches
between what the preprocessor has currently expanded, and what the
parser needs to parse a complete statement.


so, it is generally simpler and more effective to expand everything out
prior to passing it off to the parser, but the preprocessor need not
necessarily serialize back to text:
it is possible that the preprocessor spits out basically a large array
of tokens, and the parser works off this (as a separate pass).


note that my parsers have not usually used a separate lexing and parsing
pass, but have usually broken the source-text into tokens "mid-flight",
often using a cache (covering a range of 256 bytes) to avoid reading the
same tokens multiple times.




>>> [I've seen C compilers that keep preparsed versions of headers. Dunno
>>> what they do with #if. Also see Microsoft's C# and other .NET
>>> languages,
>>> that put all of the type info in the objects, so you can use the object
>>> as a compiled include file. -John]
>>
>> AFAIK, the preparsed/precompiled headers for C generally handle #if and
>> #ifdef and similar during the preprocessor as usual. this seems to be a
>> large part of why there are many restrictions on the use of precompiled
>> headers in those compilers which support them.
>>
>> AFAICT, languages like C# delay commands like #if or #ifdef until later
>> (and impose restrictions on how they may be used). IIRC, they are
>> generally handled at linking or at JIT.
>
> With a new language then it's much easier to arrange matters so that
> it's faster and simpler to compile. It might not even have conditional
> directives, or any preprocessor at all; (C needs them because it is a
> cruder, older language; I used to have conditional code, but no longer
> and in any case it seemed an unsatisfactory approach).




well, there is probably a reason why most newer languages have not been
designed this way, but this does not eliminate the problem of most
existing code being written in C or C++ (or the general ineffectiveness
of newer languages to really do well the things that C and C++ do well).




for example, my BGBScript language does not currently use a preprocessor
(I would not entirely oppose the idea, as there are possible uses, but
currently I don't use one, and although the language has an
"ifdef"/"ifndef" construction, it is handled in the threaded-code
translation stage).


however, the language is not itself likely a good C substitute (it
wasn't really designed to fill the same roles as C, and has much more in
common with ECMAScript, although it does borrow some more aspects of C
syntax and semantics).




but, it does have a merit:
even for "relatively large" modules (10-20 kB), it seems to have compile
times in the low-millisecond range, and with a little optimization could
(probably) be pushed into the microsecond range.


(the compiler has generally been fast enough that thus far I have mostly
always been loading modules directly from source).




I have considered ideas for C-like languages could compile much faster
(and potentially also allow running in a VM), but even if they looked
like C and were 99% code-compatible, if they don't follow the C
standards, they aren't really C.


a major idea here being for a language which would look-like and be
"mostly" code-compatible with C99, but would largely change those parts
of the language which impede fast compilation and generating
portable/verifiable bytecode (actually, a lot more modest/subtle than it
may seem, it would be more like a "fake" version of C).


basic ideas: traditional includes are replaced ("#include" would
actually function more like an "import" directive), the preprocessor is
merged into the syntax and no longer works via textual substitution,
declaration parsing would work more like in C#, ...


semantics would involve "reinterpreting" the meaning of a lot of things
(for example, pointer syntax and operations would be retained, but what
they "mean" would actually be replaced with something else, ...). in
some cases, it would likely incorporate many semantics and compilation
strategies used in ECMAScript-like languages as well.


(something like this would likely be an ugly beast from a design POV
though, and one can debate the value of a language which would look like
C but be functionally and semantically somewhat different).




sadly, I don't have a justification for the time/effort such a project
would require, as my main effort is trying to get a marketable 3D game
made, such that I can hopefully have *some* form of personal income...



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.