Re: no reserved words

Jerry Leichter <leichter@smarts.com>
13 Mar 1998 00:02:26 -0500

          From comp.compilers

Related articles
no reserved words hugo@morantek.demon.co.uk (1998-03-08)
Re: no reserved words cfc@world.std.com (Chris F Clark) (1998-03-12)
Re: no reserved words leichter@smarts.com (Jerry Leichter) (1998-03-13)
Re: no reserved words will@ccs.neu.edu (William D Clinger) (1998-03-15)
Re: no reserved words stephen@acm.org (Stephen P Spackman) (1998-03-18)
Re: no reserved words sandeep.dutta@usa.net (Sandeep Dutta) (1998-03-18)
| List of all articles for this month |

From: Jerry Leichter <leichter@smarts.com>
Newsgroups: comp.compilers
Date: 13 Mar 1998 00:02:26 -0500
Organization: System Management ARTS
References: 98-03-091
Keywords: parse, yacc

| I Would Like To Know If Common Tools Like Lex And Yacc, Are Capable Of
| Handling Languages That Have No Reserved Words.
| [It's nearly impossible using yacc. The 1972 yacc tech report
| basically says "don't do that". Unless you need to parse an existing
| language like PL/I without reserved words, I don't see the point.
| -John


The point is exactly what the point was when PL/I was designed: A
language with reserved words is a language in which you have to *know*
all the reserved words in order to write programs. This means you
have to know, to some degree, *all* of the language, even if you only
want to use part of it. (Recall that PL/I was intended to serve the
then- disjoint communities that did scientific programming - mainly
FORTRAN in the US and Algol in Europe - and "data processing" - mainly
COBOL. This lead to a language with features useful in one domain but
basically unused in the other. The PL/I designers didn't want users
on either side of the fence to have to learn about stuff on the other
side of the fence. Also, COBOL provide a great example of what
happens when you have a large number of keywords: As I recall, ZERO,
ZEROS, and ZEROES were all keywords. COBOL programmers had to
memorize a rather long list of "forbidden" words.)


Languages that use reserved words are hard to extend. You're stuck
with two choices: Overload an existing keyword, often the the point
that it's meaningless; or introduce new keywords and break existing
code. C has (mainly) taken the first route, C++ (recently) the
second. I recently spent quite some time helping a programmer - who'd
already spent considerably more time - figuring out that the
incomprehensible syntax error messages from a C++ compiler were due to
the use of 'typename' - now a reserved word - as an identifier. C++'s
list of reserved words is probably now comparable in length to the
COBOL list that influenced the PL/I designers.


When all you have is a hammer, everything begins to look like a nail.
Because C used reserved words, and the yacc designers were
concentrating on C and C-like languages, yacc doesn't handle
non-reserved-word languages well. So new languages reserve their
keywords in order to be compilable with yacc. A vicious cycle.


For the record: There is a general technique for "unreserving" a
keyword. Define everything as you've always done; change the name of
your Id token to, say, SimpleId; then define a non-terminal Id which
is SimpleId | <the keyword>. Your semantic routines will obviously
have to be prepared for this, but if you put all your keywords in the
symbol table this should be pretty straightforward.


Of course, after you do this, your grammar may be ambiguous. In fact,
the underlying *language* may be ambiguous if you "unreserve" its
keywords. Most languages designed with reserved keywords in mind will
meet this fate. Thus, in C, if 'if' might be a function call,


if (x)
-1;


*might* be a call to "if" with argument x, with the result decremented
by 1 and then subtracted. If your goal is to design a language
without reserved words, you have to change things at this point. The
'if' ambiguity would go away if C, like Perl, required that the
conditional be followed by a block, not a statement.


Some keywords are "identifier-like". For example, no new ambiguities
would be introduced in C if 'int' could be used as an identifier,
since syntactically it's just about indistinguishable from Int after:


typedef int Int;


(In fact, this is one of the rougher areas in parsing C/C++, since
typedef makes the grammar context-sensitive.) There is, however, one
difference between 'int' and 'Int': In a nested block, 'Int' might be
re-declared as an identifier. We could allow 'int' to be so
re-declared as well, but then there would be no way to get at the
underlying 'int' type in that context. This is mainly caused by C's
declaration syntax - and made even worse in C++. Both languages have
ambiguous constructs, which could syntactically be either declarations
or statements; both make the rule that if a declaration is possible,
that parse is taken. (I've seen C++ compilers get this wrong - and
*subtely* wrong, where two successive statements with identical syntax
are parsed differently. Wierd!) PL/I didn't have this problem
because declarations always start with DECLARE (or DCL);
Pascal-descended languages are similar. (Modula-2 doesn't reserve the
names of its built-in types, though it does provide them with special
handling with respect to modules.)


Interestingly, C++ has finally had to deal with this ambiguity, since
otherwise some declarations based on templates become unparseable.
The 'typename' keyword I mentioned above, placed before an identifier,
tells the compiler that the identifier *must* be construed as a type
name.


With suitable care, it's quite possible to come up with a rather
natural language that requires no reserved words. After all, PL/I was
like that - and while there are certainly criticisms to be made of
PL/I, I've never heard any that really centered on its syntax.


It's certainly true that a language with no reserved words is open to
abuse. Well, with power comes responsibility.


There's a tendency to start with a small language with relatively few
reserved words and, whenever any extensions are added, just reserve
the keywords used with those extensions, even if reserving them isn't
necessary to make the language/grammar unambiguous. It may be
reasonable to take a middle road: Choose a small core of keywords to
be reserved, and leave the rest as unreserved. Again, Modula-2 did
this: The basic syntactic elements are defined using reserved words,
but the built-in types have names that are just identifiers. The
language is small enough that this probably cuts the list of reserved
words in half.


-- Jerry
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.