Re: syntax spec. for C ?
19 Jan 2006 01:08:44 -0500

          From comp.compilers

Related articles
syntax spec. for C ? (JR) (2006-01-07)
Re: syntax spec. for C ? (David R Tribble) (2006-01-12)
Re: syntax spec. for C ? (2006-01-12)
Re: syntax spec. for C ? (2006-01-19)
| List of all articles for this month |

Newsgroups: comp.compilers
Date: 19 Jan 2006 01:08:44 -0500
References: 06-01-00806-01-042
Keywords: C, syntax
Posted-Date: 19 Jan 2006 01:08:43 EST wrote:
> C (and its successors and relatives) on the Wikipedia:

(The syntax listed under the "C Syntax" page, which is directly
accessible from there, has a few errors in it, in fact)

As I said, time permitting, I'd show you a more concise C syntax. As
far as I know, the only ANSI-C specific features are the const and
volatile qualifiers and function prototype syntax.

This is an adaptation and extension of the K & R and ANSI-C grammars.

Phrase structure rules are listed as LHS -> RHS. The LHS has a
non-terminal, the RHS is a regular expression composed from terminals
and non-terminals. This is actually a text version of a HTML file I may
put up in the near future, so the terminals which were displayed in
boldface, will have to be gleaned from context.

Regular expression operators include A* to denote 0,1,2 or more of A;
A+ to denote 1,2 or more of A; [A] to denote 0 or 1 of A.

In some cases, terminals fall naturally into classes, so in place of a
phrase structure rule is a rule of the form LHS: list of terminals; the
"non-terminal" on the LHS being regarded as a terminal as far as any
other rule where it occurs in on the RHS is concerned.

Extra commentary is added to the syntax.

From the lexical level
X: Identifiers
C: Character strings; integer, character, floating point and
enumeration constants.
qual: Qualifiers (volatile, const)
store: Storage specifiers (auto, register, static, extern, typedef)
scalar: Scalar type specifiers (void, char, short, int, long, float,
double, signed, unsigned).

From the syntactic level
Top Level
Phase Structure Rule
Unit --> (FunctDef | Dec)+

Declarations, Definitions and Types
Phrase Structure Rule Comment
FunctDef --> Sp* Dc Dec* St Function definitions (only compound
statements allowed)
DecF --> Sp+ DcF (, DcF)* ; Fields
Dec --> Sp+ DcI (, DcI)* ; Top-level and block-level
Type --> Sp+ DcT Types
Type --> typeof Ex (Possible extension; not part of C)
DecP --> Sp+ DcP Parameters
DecE --> DcE (, DcE)* Enumerations

Types declarations are split, in C, between a prefix and postfix part.
The prefix is called a speciifer and contains all the substantial
information about the base where the type is drawn from. The subscript
-- in the original conception of K&R -- is an exponent or Kleene star.
Hence, as per the original thought of K&R, A* was to be equivalent to
the union of A^0, A^1, A^2, A^3, ...; and the type A[n] = A^n -- the
nth power product of the type A (i.e., an "array" with n copies of A in

The suffices along with the variable name(s) together comprise the

In modern terms, one may think of variables as being of a reference
type. So a variable X within the context of a declaration Sp X Dc, with
Dc being the suffix part of the declaration and Sp the prefix part, is
essentially being declared as X :: &(Sp Dc), where & denotes the
"reference of" type. This syntax, unfortunately, is only available in
C++ (and maybe C-99, I didn't check).

The best way to understand the K&R "theory" then is to think of the
operators & and * as being polymorphic operators of generic type:
                                                                  & :: T -> (1 -> T)
                                                                    * :: (1 -> T) -> T
and "1" as being a kind of special type whose only resident is the
0-tuple, (). Hence &A would mean (lambda ().A) and *A would mean (A()).

A variable X declared of type T then, itself, has type &T, as mentioned
above. There is an automatic dereferencing operator deref: &T -> T that
"demotes" a variable to a value in those contexts where only values are
permitted. In the expression syntax below, the contexts where variables
are permitted or required are denoted by L in the following:
                                                L = E; L op= E; L inc; inc L (where inc is ++
or --); &L
                                                L.F (variable allowed in this context but not
The expressions that make variables are L.F, P->F, *P, E[P], P[E] and
X, where P stands for a pointer value and X an actual name (i.e.,
identifier) of a variable.

The pointer type may then be defined by (T *) = (1 -> &T); an array
type T[n] as T^n. The interpretation mentioned above is then enforced
by the use of automatic conversions,
                                                                                    T^n -> T*,
hence making T* as a Kleene star T* = T^0 + T^1 + T^2 + ...

This could be further interpreted within the context of the typed
Lambda calculus (or an extension thereof) if one were to extend the
"lambda" into a full-fledged operator with type signature: lambda &T.U
:: T=>U (where T=>U denotes the type corresponding to functions that
take a value of type T and return a value of type U). One then needs an
interpretation for an expression like lambda *p.e, such that in context
(p = c? &x: &y) it would take on the value (c? lambda x.e: lambda y.e).
Nobody's ever extended the lambda calculus this way to make it a
calculus sufficiently expressive to directly embody a C-like language.

But all the foregoing, in turn, will help explain where the somewhat
unusual (but, in retrospect, highly intuitive) syntax for types and
type declarations comes from.

Most of the difference between C and C++ resides in the declarative
part of the syntax. And it's a rather unfortunate feature of C++ that
instead of consolidating disparate elements of the syntax, it actually
further fragments it (almost as bad as FORTRAN fragments its syntax)
with the introduction of a high degree of non-orthogonality. So, trying
to write out a syntax for C++ in the same vein as this syntax would be
a major project (but one I actually did, and may post in the near
future and put up on the Web).

Type Specifiers
Phrase Structure Rule Comment
Sp --> qual Only in Dec, DecP, FunctDef,
Sp --> store Only in Dec, DecP, FunctDef.
Sp --> scalar
Sp --> [struct | union | enum] X
Sp --> struct [X] { DecF+ }
Sp --> union [X] { DecF+ }
Sp --> enum [X] { DecE }
Sp --> ( Type ) (Possible extension; not part
of C)

Type Declarator Contexts
Phrase Structure Rule Comment
DcE --> X [= ExC] Enumerations
DcI --> Dc0 [= ExS] Top-level and block-level declarations
DcF --> Dc0 Fields
DcF --> [Dc0] : ExC Bit-Fields
DcP --> Dc0 Parameters
DcT --> Dc0 Types

Type Declarators
Phrase Structure Rule Comment
Dc0 --> Dc1 (2 precedence levels)
Dc0 --> * qual* Dc0 Pointers

Dc1 --> X Variable name
(not allowed in DcT)
Dc1 --> Only in DcP, DcT
Dc1 --> ( Dc0 ) (but ( ) is not
Dc1 --> Dc1 [ [ExC] ] Arrays
Dc1 --> Dc1 ( [DecP (, DecP)* [, ...]] ) Functions prototypes
Dc1 --> Dc1 ( X (, X)* ) K&R prototypes (not
allowed in DcT)

The syntax minus the precedence rules and constraints is:
Dc --> * qual* Dc
Dc --> [X]
Dc --> ( Dc )
Dc --> Dc [ ExC ]
Dc --> Dc ( [DecP (, DecP)* [, ...]] )
Dc --> Dc ( X (, X)* )

Statements may be thought of as "contexts" or "holes" in which
expressions may be fitted to obtain a larger expression as a result.
Thus, if S is a statement, and S[] its corresponding context, then S[Q]
would be naturally interpreted as the value of Q "after S" is run.

This manner of interpretation automatically implies that contexts may
be representative of infinitely large expressions. Hence, the statement
                                  while (E) S
as a context may be regarded as an infinitary context as follows
                                  while (E) S [Q] = E? S[E? S[E? S[...]: Q]: Q]: Q
The comma statement is the composition of contexts, (S,T)[Q] = S[T[Q]].
The branch statements are contexts pertaining to conditionals:
                            if (E) S [Q] = E? S[Q]: Q
                            if (E) S else T [Q] = E? S[Q]: T[Q]
(the switch is harder to represent in this way, so I'll forego the
description of switch, case and default).

A labelling may be thought of as marking a subexpression in the
infinite expression tree; e.g. x: S is interpreted as the context (x:
S)[Q] = x: S[Q], with x marking S[Q]. A goto is then an explicit
indication of a subexpression; hence (goto x;)[Q] = x.

The continue and break statements are likewise interpreted, with them
corresponding to the "c" and "Q" in the following interpretations of
loop statements, with each continue and break paired off to the nearest
enclosing loop (or, in the case of "break", switch) statement:
                          while (E) S [Q] <--> c: E? S[c]: Q
                          do S while (E); [Q] <-> x: S[c: E? x: Q]
                            for ([A]; B; [C]) S [Q] <-> [A;] x: B? S[c: [C;] [x]]: Q

An expression statement A; is interpeted in the context Q as (A;)[Q] =

A description of return statements and procedure calls will be avoided
here, since this require extending the notion of infinitary expressions
yet further.

Theoretically, C can be thought of as a concrete syntax for providing
finitary descriptions of infinitary expressions in an extended version
of the typed Lambda calculus that provides also for a reference type
with the above conversions involving pointers, variables and arrays.

Phrase Structure Rule Comment
St --> X : St Labelled statements
St --> case ExC : St
St --> default : St

St --> { Dec* St* } Compound statement

St --> if ( Ex ) St [else St] Branch statements
St --> switch ( Ex ) St

St --> while ( Ex ) St Loop statements
St --> do St while ( Ex ) ;
St --> for ( [Ex] ; [Ex] ; [Ex] ) St

St --> [Ex] ; Expression statement

St --> goto X ; Continuation statements
St --> continue ;
St --> break ;
St --> return [Ex] ;

Expression Contexts
Phrase Structure Rule Comment
Ex --> Ex0 General expressions
ExS --> Ex1 Structured expressions (in initializers)
ExC --> Ex2 Constant expressions

Constant expressions are required in enumerator initializers,
bit-fields, array declarators and case labels and are characterized by
the absence of variables within them.

Expressions Phrase Structure Rule Comment
Exi --> Exi+1 (i = 0,...,15) (17
Precedence Levels)
Ex2 --> Ex3 ? Ex0 : Ex2 Conditional

Infix operators:
Ex0 --> Ex0 , Ex1 Sequence
Ex1 --> Ex14 assOp Ex1 Assignment
Ex3 --> Ex3 || Ex4 Logical OR
Ex4 --> Ex4 && Ex5 Logical AND
Ex5 --> Ex5 | Ex6 Bit-wise OR
Ex6 --> Ex6 ^ Ex7 Bit-wise XOR
Ex7 --> Ex7 & Ex8 Bit-wise AND
Ex8 --> Ex8 eqOp Ex9 Equality
Ex9 --> Ex9 relOp Ex10 Comparison
Ex10 --> Ex10 shOp Ex11 Bit-shift
Ex11 --> Ex11 addOp Ex12 Additive
Ex12 --> Ex12 mulOp Ex13 Multiplicative

Prefix operators:
Ex13 --> ( Type ) Ex13 Type-casting
Ex14 --> unOp Ex13 Prefix operators
Ex14 --> incOp Ex14 Prefix increment
Ex14 --> sizeof Ex14 Expression/type size
Ex14 --> sizeof ( Type )

Postfix operators:
Ex15 --> Ex15 incOp Postfix increment
Ex15 --> Ex15 accOp X Structure access
Ex15 --> Ex15 [ Ex0 ] Array access
Ex15 --> Ex15 ( Ex1 (, Ex1)* ) Function call
Ex1 --> { Ex1 (, Ex1)* [,] } Arrays/Structures (Allowed only in

Ex16 --> ( Ex0 ) Sub-expressions
Ex16 --> C Constants
Ex16 --> X Variables

assOp: =,*=, /=, %=, +=, -=, <<=, >>=, &=, ^=, |=
eqOp: ==, !=
relOp: <, >, <=, >=
shOp: <<, >>
addOp: +, -
mulOp: *, /, %
unOp: &, *, +, -, ~, !
incOp: ++, --
accOp: ., ->

The syntax with the constraints suppressed:
Ex --> Ex ? Ex : Ex
Ex --> Ex op Ex
Ex --> ( Type ) Ex
Ex --> op Ex
Ex --> sizeof Ex
Ex --> sizeof ( Type )
Ex --> Ex op
Ex --> Ex accOp X
Ex --> Ex [ Ex ]
Ex --> Ex ( Ex (, Ex)* )
Ex --> { Ex (, Ex)* [,] }
Ex --> ( Ex )
Ex --> C
Ex -> X

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.