Re: Semantics make a grammar ambiguous (RKRayhawk)
7 May 1999 01:11:27 -0400

          From comp.compilers

Related articles
Semantics make a grammar ambiguous (Bill A) (1999-04-26)
Re: Semantics make a grammar ambiguous (Robert Sherry) (1999-04-29)
Re: Semantics make a grammar ambiguous (1999-04-30)
Re: Semantics make a grammar ambiguous (Bill A.) (1999-04-30)
Re: Semantics make a grammar ambiguous (Tzvetan Mikov) (1999-05-03)
Re: Semantics make a grammar ambiguous (1999-05-07)
| List of all articles for this month |

From: (RKRayhawk)
Newsgroups: comp.compilers
Date: 7 May 1999 01:11:27 -0400
Organization: AOL
References: 99-04-087
Keywords: C, parse

"Bill A" <>

In the following C fragment

typedef struct { int i; } str;

void func (void)
str *str;

I'm having difficulty parsing this with an LALR parser (generated from
a parser generator) because the second str should be an id but matches
a typedef definition so a typedef terminal is returned from the lexer.
What is the traditional way to handle this ambiguity? Is this a good
case of why most production compilers are recursive decent?


Another idea that relates to this discussion is lexer start states.

But you do not have to have a lexer that supports this feature or even
use the feature available to you in your lexer of choice. You really
just need to think of modalities in the lexer.

The notion would be that once you encounter a typedef invocation (not
a declaration of a typedef nor a redeclaration of a typedef) enter a
typedef lookup supression mode until you hit the generic statement
terminator, block close, or error recovery signal from the parser.

When in this mode either suspend all typedef lookups, or suspend that
specific typedef lookup. If you have multiple symbol tables, then the
suspension prevents lookups on the typedef symbol table altogether (or
mediates it by preventing such a lookup if the text has a value
identical to current typedef). If you have a composite symbol table,
you can modally reject hits on typedefs entirely, or reject specific
hits on current typdef - possibly necessitating a design that allows
the search to restart after the false hit.

In other words this can be viewed as simply a lexical issue, and does
not involve the question of whether the grammar component is recursive
descent. Your question posits a programming paradigm in which the
typedef tag, a typedef name, and a variable name are separate text
domains. That is essentially a lexical concept. (It can also be a very
confusing concept to the next programmer down the line! Indeed, C
beginners sometimes have difficulty comprehending the difference
between a tag and a typedef name.)

Although, as mentioned in another post, you do have to have a clear
idea of the error recovery machinery in the parser. You will probably
need to disengage any typedef modal code in the lexer (implemented
with start states or simple switches), upon error detection in the
parser. For example, technically, it might be worth considering
flipping the typedef suppression switch back to normal (off) only upon
the equivalent of "error okay" (yyerrok in yacc).

But this whole road has a bumpy feel to it. If you do get *str
established as a variable. Is your intent then to have this block all
further uses of the str typedef in that scope? You see there is a
bigger issue here. What is your programming paradigm?

Let's say that you have parsed

    typedef struct { int i; } str;

somewhere upstream, and now you successfully parse

    str *str;

What happens when you see code that starts with

    str ...

I know that parser technology can look ahead, but what of the lexer?
Is str now an ID (a dataname), can I not declare anything else with
the str typedef until variable (ID) str goes out of scope?

My vary next statement might be

    str *str2;

That will work if typedef lookup is fully re-engaged, but then if my next
statement is


it will surely not parse.

So the paradigm of these distinct name spaces seems only relevant to
an application where we can have mutually exclusive domains for the
name spaces along the sequence of tokens. That implies a grammar that
has leading tokens (like COBOL verbs or data item level numbers) to
harness alternating domains. This does not seem progressive.

Further, any time a lexeme can genuinely be two distinct kinds of
things, erroneous code could parse successfully! Silent errors are
the worst. Two things which are different in kind perhaps should have
two distinct names.

For example, some shops impose the standard that typedefs are all
caps. Thus, as long as your projected language distinguishes case,
STR could not be mistaken for str.

At any rate, it is possible to view this problem as not really
directed by the syntax. Instead the discovery by the lexer that it has
a current invocation of the typedef facility can suspend further
recognition of that string as a typedef until a defined next
event. That type of cheating has been built into lexers with start
state features, or can be implement by hand. Multiple name spaces
requires some kind of modality in the lexer if it is to support the
parser's distinct handling of look-a-like lexemes. The problem really
is that you may not want to do this because a lexically confused
expression could easily have a semantically valid interpretation which
could be a killer in an important application.

Best Wishes,

Robert Rayhawk

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.