Re: Parsing partial sentences

Hans-Peter Diettrich <>
Wed, 12 Apr 2017 13:45:41 +0200

          From comp.compilers

Related articles
[6 earlier articles]
Re: Parsing partial sentences (George Neuner) (2017-04-10)
Re: Parsing partial sentences (Hans-Peter Diettrich) (2017-04-11)
Re: Parsing partial sentences (Martin Ward) (2017-04-11)
Re: Parsing partial sentences (Hans-Peter Diettrich) (2017-04-11)
Re: Parsing partial sentences (Martin Ward) (2017-04-11)
Re: Parsing partial sentences (George Neuner) (2017-04-11)
Re: Parsing partial sentences (Hans-Peter Diettrich) (2017-04-12)
Re: Parsing partial sentences (Hans-Peter Diettrich) (2017-04-20)
Re: Parsing partial sentences (George Neuner) (2017-04-21)
Re: Parsing partial sentences (Walter Banks) (2017-04-27)
Re: Parsing partial sentences (Kaz Kylheku) (2017-04-27)
Re: Parsing partial sentences (Hans-Peter Diettrich) (2017-04-28)
| List of all articles for this month |

From: Hans-Peter Diettrich <>
Newsgroups: comp.compilers
Date: Wed, 12 Apr 2017 13:45:41 +0200
Organization: Compilers Central
References: 17-04-001 17-04-002 17-04-003 17-04-004 17-04-006 17-04-007 17-04-008 17-04-012
Injection-Info:; posting-host=""; logging-data="72560"; mail-complaints-to=""
Keywords: C, parse
Posted-Date: 14 Apr 2017 18:27:37 EDT

Am 12.04.2017 um 05:05 schrieb George Neuner:
> On Tue, 11 Apr 2017 10:31:30 +0200, Hans-Peter Diettrich
> <> wrote:

> In terms of implementation, bottom-up [state machine] tends to be much
> more efficient at handling languages that have less structure.

Provided that type names are recognized distinct from other identifiers,
the C language is almost LL(1). I only found one or two exceptions which
require LL(2). In so far my parser is very efficient.

C++ were much harder to parse, but its class model is so incompatible
with other languages, that a source-level translation is almost impossible.

> C's syntax is (structured but) far more permissive than Pascal's:
> e.g., C allows declarations to be intermixed with other statements,
> the only real requirements being that the declaration be in scope of
> and preceed any use of the declared object.

I've already solved this language difference. Some other constructs
require the use of objects and properties, to retain most of the C
"wording" in OPL code.

>>> Building on John's example, consider what you'd do with
>>> # define FOO +
>>> # define BAR + 42
>>> # define BAZ + c /* note 'c' is undefined */
>> All these can not be reduced into constants or functions. Eventually a
>> problem may arise for "+ 42", where the '+' could be interpreted as the
>> sign of a constant value.
> The point is that the preprocessor doesn't care about the meanings of
> tokens in the #define body. Modulo parameter substitution, the
> preprocess simply inlines the #define body at the point of use.
>> In this case the following code
>>> int a, b;
>>> :
>>> a = a FOO b BAR BAZ;
>> would raise a parser error "expecting operator between BAR and BAZ".
> After preprocessing, the C compiler will see
> a = a + b + 42 + c;
> The compiler will complain about 'c' being undeclared, not about a
> missing operator. Try it.

I already mentioned that "+ 42" is ambiguous, and which error a compiler
would find with the attempted interpretation as a value "+42". At least
an error would occur later, when a usage of that macro is parsed, so
that BAR could be found out as being not-a-constant.

Actually I have no specific ideas yet, how unbound identifiers could be
handled in a transformation of macros into functions. Even bound
identifiers (macro arguments) can not be handled easily, not even in a
textual macro expansion.

>> A more practical example were windows.h. I expect much more than 50%
>> named constants in it, which could be detected and converted easily.
>> Then this automated handling could reduce the manual inspection and
>> classification of many hundreds (thousands?) of #defines to a few
>> unhandled or not easily translatable macros.
> In the case of Windows: 10s of thousands, counting all the nested
> includes.

This was the point when I gave up to improve ToPas, many years ago. Now
I'm considering another attempt to add more automatisms to it, for
handling at least literal constants and, possibly, constant expressions.
I'd be happy if this helps to reduce the number of unclassified #defines
in windows.h, to an amount that could be handled manually in reasonable
time. The same for the GNU (gcc) headers. I have to catch up with the
header evolution till now, and find out how translation-friendly these
are nowadays.

> Macro substitution variables, if any, are local to the #define body -
> they are not free references to C variables. So if you see something
> like
> # define dumb(x) (x << y)
> then you have a bound parameter 'x', a free variable 'y', and an
> ambiguous expression of unknown types which must *not* cause an error
> during parsing of the macro, but *must* cause an error if the fully
> expanded and substituted expression is invalid AT THE POINT OF USE.
> You rarely will have any clue regarding the types of variables that
> appear in a macro. In the example above, 'x' is a purely mathematical
> variable, and the type of 'y' is unknown. In context at the point of
> substitution, when the types finally become known, the '<<' operator
> may not even be valid.

You are perfectly right. I don't think that code snippets with unbound
identifiers should ever be subject to a transformation into functions.

Eventually a special kind of function inlining could be implemented,
that allows to proof the correctness of assumptions about the location
of all identifiers, used in later macro expansion and calls of the
constructed functions. But that only in a far future...

>> Our esteemed moderator wrote:
>> #define FOO a + b
>> d = FOO * c;
>> be sure that your preparsed FOO doesn't expand into (a+b)*c. That
>> may well be what the programmer meant, but it's not what she said.

Well, this might be a case where automatic source code analysis could
reveal possible coding bugs. Remember how many parentheses are (or are
not) added around macro arguments, just to prevent such unintended
compilation. This is another reason why I don't like C, to put it mildly
<BG>. My dream is an automated source code translation system, that
allows to safely translate and get rid of legacy C code, so that further
software development could be done in some better (safer, faster...)
programming language...


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.