Re: language twiddling, was Infinite look ahead required by C++?

Chris F Clark <cfc@shell01.TheWorld.com>
Mon, 01 Mar 2010 02:43:06 -0500

From comp.compilers

Related articles
Infinite look ahead required by C++? ng2010@att.invalid (ng2010) (2010-02-05)
Re: Infinite look ahead required by C++? sh006d3592@blueyonder.co.uk (Stephen Horne) (2010-02-09)
Re: Infinite look ahead required by C++? ng2010@att.net (ng2010) (2010-02-23)
Re: Infinite look ahead required by C++? cfc@shell01.TheWorld.com (Chris F Clark) (2010-02-27)
Re: Infinite look ahead required by C++? bartc@freeuk.com (bartc) (2010-02-28)
*Re: language twiddling, was Infinite look ahead required by C++? cfc@shell01.TheWorld.com (Chris F Clark)* (2010-03-01)**
Re: language twiddling, was Infinite look ahead required by C++? gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-03-03)
Re: language twiddling, was Infinite look ahead required by C++? bobduff@shell01.TheWorld.com (Robert A Duff) (2010-03-05)
Re: language twiddling, was Infinite look ahead required by C++? bobduff@shell01.TheWorld.com (Robert A Duff) (2010-03-05)
Re: language twiddling, was Infinite look ahead required by C++? cfc@shell01.TheWorld.com (Chris F Clark) (2010-03-07)
Re: language twiddling, was Infinite look ahead required by C++? bartc@freeuk.com (bartc) (2010-03-08)
Re: language twiddling, was Infinite look ahead required by C++? cfc@shell01.TheWorld.com (Chris F Clark) (2010-03-10)
[6 later articles]

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	Mon, 01 Mar 2010 02:43:06 -0500
Organization:	The World Public Access UNIX, Brookline, MA
References:	10-02-024 10-02-039 10-02-086 10-02-088 10-03-003
Keywords:	parse, design
Posted-Date:	03 Mar 2010 01:15:56 EST

"bartc" <bartc@freeuk.com> writes:

> "Chris F Clark" <cfc@shell01.TheWorld.com> wrote in message
>>
>>> // var declaration
>>> var mystruct x
>>
>> If you are going to go that route, follow the footsteps of a languqage
>> that was designed to be (relatively} easily parseable, Pascal.
>> Seriously, Pascal was designed to be almost entirely LL(1).
>> Syntactically, there are relatively few goofs in Pascal. Playing
>> around trying to re-invent that on your own is a recipe for failure.
>
> Adding one keyword to solve a specific problem sounds more reasonable
> than redesigning a language and compiler...

That would potentially be fine if a keyword was likely to solve the
problems, but in this case, they won't, and the other person won't
realize they won't until much further down the line.

The basic idea of reserved word (i.e. a keyword you can't use for
anything else) specifying a declaration followed by a list of
identifiers being declared is a sound one and is used in the languages
leading upto C. Using list of specifiers in declarations works ok
also, as in "static int x". However, if you allow user defined
identifiers in those locations, then you need to follow C's method of
declaring those identifiers before their use and making the lexer turn
them into special tokens, or you need some other special syntax to key
off of. Adding a new keyword the introduces variable declarations
will not be sufficient, not unless you are going to severely restrict
the syntax of the declarations.

The reason I recommended the person consider Pascal is that it has all
those problems solved and solved simply. C is not a particularly easy
language to parse. If you go trying to extend C, it is very easy to
make the language completely unparseable. A few tiny bandaids on C
don't resolve those issues.

Moreover, if you look at C, you will see that they had the struct
keyword before the type name in the struct case to try to avoid that
problem. A typedef name was a different story, and to make that work
they needed the declaration-before-use-have-the-lexer-change-the-
token-type solution. C++ came along and munged the two together and
made everything messier (from a parsing point of view).

Unfortunately, so many people these days know only C derived languages
and they don't understand other options, nor the issues involved.

> I had pretty much the same problem (where I didn't know that
> 'mystruct' for example was a type name and therefore the start of a
> type-spec), I had to resort to some immediate name resolution (and
> always have the usertypes declared in advance) instead of leaving that
> until a later pass.

If you are following the C paradigm, that is one of the few solutions
available to you.

> A related problem was with the ambiguity between:
>
> [10]int x #an array declaration, and :
> [10] #a set constructor in an expression
>
> Sometimes, array declarations can occur in expressions (type conversions for
> example), so I introduced an 'array' keyword:
>
> a := array[]int(y)

That use of a keyword works, at least as long as you don't have an
arbitrary list of things after it, and it doesn't sound like you do,
just a [set of?] subscript clauses. That's different from the
preceding case, because you don't have arbitrary identifiers
representing subscripts, you have [optional? sets of?] subscript
expressions enclosed in brackets. The brackets give you nice
boundaries and only there use as something else makes them ambiguous.
I assume that the subscript brackets are always followed by a type
name (even if that type name is a user defined identifier) and never
by a set constructor. As long as that's true, you will probably be
ok.

> which however is optional and only needed in rare circumstances.

Making it optional is making it risky. It also makes it harder to
extend later, because some case that is unambiguous today (and thus
doesn't require an "array") may become ambiguous later when you add
some other feature.

>> Sticking a var keyword on the beginning of C declarations, that
>> declare a variable is only a poor bandage that doesn't solve many
>> (much less most) of the ambiguities in C.
>
> When C lets you write:
>
> static const long long unsigned int x
>
> then a 'var' keyword would get lost in the noise.. And I think C requires a
> 'struct' keyword when declaring a struct variable, so that is little
> different from requiring 'var' when declaring a variable of a user type
> (assuming 'var' would be optional otherwise).

That would be fine if the var option actually cleared up any important
ambiguities. However, it doesn't. The key issues tend to be between
parenthesized declarators, function parameters lists, and constructor
calls. Those three things all have parenthesized (lists of) things
whose meanings all often depend on which names are type names and
which are not. The person is attacking the wrong problem and adding a
solution that is unneeded and unhelpful and simply makes things more
complex.

Moreover, if you make the var optional when it "isn't needed", to be
more compatible with normal C programs, you will find that you haven't
even solved that problem. If you don't make the var optional, you
lose the beauty of user defined types being symmetric with built-in
types, although that symmetry isn't perfect to begin with. If you
require the var in all cases, user-defined or built-in type, are you
sure you have reduced the ambiguity and there aren't still cases after
the var when you need to know whetheran identifier is a user-defined
type or not to process the text after the var properly? Moreover, if
you always require the var, why not change the declarations to
something known to be unambiguous, your variable declarations are
strictly C compatible anyway?

>> You will need a func keyword to introduce function declarations
>> also. Perhaps a type keyword to introduce type declarations.
>
> Like 'typedef'? Type and function declarations aren't as ubiquitous as
> variables, and are more special, so can do with the keyword to help them
> stand out.

Yes, typedef works for type declarations. The point of those
sentences was to illustrate other cases where unambiguous reserved
words are helpful and to illustrate how Pascal uses such words to make
parsing simpler.

>>> I'm trying to avoid lex/parse generators. I want to do it by hand
>>> and bootstrap from C/C++ until my language can compile itself.
>>
>> Again, wrong approach. The purpose of generators is to make your life
>> easier and to prevent you from making foolish mistakes. The language
>> of most generators is a variation on BNF and is relatively simple to
>> learn.

> It seems like that just offloads the work of a writing parser, to that
> of writing the BNF (which in my projects would never work as they are
> full of ambiguities). And there are still bits of code to fill in, if
> my memory serves me correctly.

And that's the point, if the parser generator is telling you that your
grammar is full of ambiguities, it is telling you that you are making
mistakes. You need to listen to it. You need to learn how to write
grammars that are unambiguous. Otherwise, you will continue to write
grammars where you need to insert "array" before some subscripts, but
not all, and you will keep trying to patch things up.

In the end, if you learn how to do that, you will find a new kind of
clarity and simplicity. Then, you can decide where you want to bend
the rules to make certain things less verbose, but you will be making
that decision from a place of understanding what the implications are.

> As BGB has said a few times, parsing is nothing compared with some
> of the other bits of a compiler.

Having built all the pieces of a compiler several times, I can say
that it is generally true, especially if you are working with a
well-defined syntax. However, syntax design of your own language is
actually quite hard to do well, easily on par with the complexity of
any part of a compiler. People make the mistake of the ease of
writing a parser for a well-defined syntax with the difficulty of
coming up with a well-defined syntax in the first place. This is
partially why most languages are syntactically similar to other
existing languages. People can add an operator or two and a new
keyword or two, but actually changing the shape of a language (and
making it work) is exceptionally difficult to do.

Even with a language as simple as Yacc that we wanted to change in
some specific ways in Yacc++, there were limits of how much change we
could do. Writing the parser was the easy part. Changing the syntax
and not making things irregular or ambiguous was far more challenging.

Just my opinions,
-Chris

******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: language twiddling, was Infinite look ahead required by C++?

Chris F Clark <cfc@shell01.TheWorld.com>Mon, 01 Mar 2010 02:43:06 -0500

Chris F Clark <cfc@shell01.TheWorld.com>
Mon, 01 Mar 2010 02:43:06 -0500