RE: language design for parsing, was C++ intermediate representation.

Quinn Tyler Jackson <quinn-j@shaw.ca>
18 May 2005 12:04:20 -0400

From comp.compilers

Related articles
Re: language design for parsing, was C++ intermediate representation. mefrill@yandex.ru (2005-05-18)
*RE: language design for parsing, was C++ intermediate representation. quinn-j@shaw.ca (Quinn Tyler Jackson)* (2005-05-18)**
RE: Separation of Syntax and Semantics (was language design for parsin quinn-j@shaw.ca (Quinn Tyler Jackson) (2005-05-19)
Re: language design for parsing, was C++ intermediate representation. mefrill@yandex.ru (2005-05-19)

| List of all articles for this month |

From:	Quinn Tyler Jackson <quinn-j@shaw.ca>
Newsgroups:	comp.compilers
Date:	18 May 2005 12:04:20 -0400
Organization:	Compilers Central
References:	05-05-162
Keywords:	C++, parse, design, comment
Posted-Date:	18 May 2005 12:04:20 EDT

Vladamir said:

> I am a little bit confused... As I correctly understood!? It seems
> that we are talking about two diferrent things.
>
> I am talking about programming language parsing, not about formal
> grammar manipulating, not about correctness proving, only about
> parsing. What are talking about?

Exactly the same thing.

Suppose we have a programming language L that has this construct:

#@re@a@

Where # is some symbol (such as "s"), re is a regular expression as commonly
understood (except that it cannot contain an unescaped #, @ is anything
other than #, and a is some string of letters other than @.

For instance:

s/a.*e/orange/i
s|/c.t/|/mouse/|i

Now, without code, how can that be written in fewer than a few hundred
or more context-free productions?

That's a real programming language construction that occurs in the
wild, but it cannot be handled by lex, yacc, predicated parsers. So,
it is handled in the code (maybe the lexer code). The moment it is
handed over to the code, all bets are off in terms of detecting
problems with the formal grammar.

The parser *may* be broken by such ad hockery. It may not. Whether it is or
is not is a matter of trust. That is to say, the amount of trust you have in
the implementation. You have to believe that the person responsible for
coding the lexer didn't write it in such a way that this isn't accepted:

s|up/down/i

Sure, in a simple case like this, one can probably trust the implementation.
But part (not all) of the point of formal grammars is that they can be
automatically checked for stupid mistakes. In grammars with hundreds of
productions (since we are talking about programming languages), there can
easily be enough of those -- introducing them by hacking in Turing powerful
fixes to such context-sensitive constructions.

I realize that many believe religiously in the separation of syntax
and semantics, but I haven't been convinced it is such a good
thing. If such constructions can be handled in a formal way -- they
should be -- unless the more formal way is such a monstrosity that it
cannot be maintained, or unless no parser can be constructed to parse
such languages from formal specifications in reasonable time.

Consider the following rule:

"In C++, no variable shall be referenced before it is declared, unless
the declaration is a member variable in a class, and unless the
reference is in an inline member function declared and defined before
the member variable in that same class."

class Foo
{
public:
      Foo : m_x(_x) (int _x) {}

      void Bar() {m_i = 10 * m_x;}

private:

      int m_i;
      int m_x;
};

m_i is declared after it appears in use in the code, but the above is
legal C++, whereas the following is not:

class Foo
{
public:
      Foo : m_x(_x) (int _x) {}

      void Bar() {m_i = 10 * m_x;}

private:

      //int m_i; -- note that this is commented out
      int m_x;
};

The moment the parser sees };, an error should be generated. Because
of the technologies available, this construction will be universally
dealt with outside of the formal specification of the C++
grammar. However, with a sufficient set of tools, the above rule can
be expressed formally, and can be checked in decent time in a real
parse. The grammar thereby reflects the formal specification of this
particular set of semantics.

The above scenario appears in a paper currently under
consideration. It's just one of the many things that can be cleanly
and efficiently represented with a non-context-free grammar formalism.

Some won't like it. Others will. But it's still about programming
languages and compilers.

Whether or not such constructions as these *should* exist in
programming languages or not is a religious issue. The fact is -- they
do, and they must be parsed. Whether or not they are handled in the
formal specification or the "code" is another issue: a technological
one.

--
Chev. Quinn Tyler Jackson
http://members.shaw.ca/qjackson/

[You can't separate syntax and semantics. Back when I was doing a
logic and computability seminar in the math department in college, one
of the things we proved was that anything you can define semantically
you can in principle put into the syntax (perhaps with severe bloat,
which mathematicians don't care about) and vice-versa. So here in
computerland the question is really what's easier to understand and to
maintain when you're drawing the line. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

RE: language design for parsing, was C++ intermediate representation.

Quinn Tyler Jackson <quinn-j@shaw.ca>18 May 2005 12:04:20 -0400

Quinn Tyler Jackson <quinn-j@shaw.ca>
18 May 2005 12:04:20 -0400