Re: A big easy subset of the CFGs

Chris F Clark <>
19 Jun 1998 23:45:21 -0400

          From comp.compilers

Related articles
A big easy subset of the CFGs (Matt Timmermans) (1998-06-18)
Re: A big easy subset of the CFGs (1998-06-19)
Re: A big easy subset of the CFGs (Chris F Clark) (1998-06-19)
Re: A big easy subset of the CFGs (Matt Timmermans) (1998-06-24)
Re: A big easy subset of the CFGs (Salvador Valerio Cavadini) (1998-06-24)
Re: A big easy subset of the CFGs (Chris Clark USG) (1998-07-01)
| List of all articles for this month |

From: Chris F Clark <>
Newsgroups: comp.compilers
Date: 19 Jun 1998 23:45:21 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 98-06-104
Keywords: parse, errors

Matt Timmermans asked for,
> A subset of CFGs bigger than LR(1). In particular, it should
> include most of what you can do with LR(1)+lex, using characters as
> terminals.

If I understand it, your desire actually has 3 parts.

1) A large "non-bizarre" grammar set
2) Using characters rather than tokens in the parser
3) regular expression operators in the grammar

For 1, the response from Dennis Mickunas should provide plenty of
starting points. In particular, I believe Tai's non-canonical parsing
may be an exact match for the set you are looking for. Another
alternative is the Lang-Tomita parsing approach. Horspool has a nice
paper which follows that line of thought. From what I gather, Visual
Parse++ implements LR-regular parsing--and their natural language
parsing sounds like the Lang-Tomita approach.

For 2, Tai had a Sigplan article entitled something-like "Scannerless
Non-Canonical SLR Parsing" which addressed using a parser without a
lexer. The Annagram parser generator merges parsing and
lexing. (Yacc++, our tool, uses roughly the same notation for both,
but keeps them distinct--partially because the lexer parser boundary
provides a good place for user customizations (i.e. hacks).)

For 3, there are two approaches. Most parser generators follow
Cellentano, which is to translate the regular expression operators
into hidden rules. The other alternative is to implement the regular
expressions by directly expanding them as part of the table generation
process. Nakata and Sassa were the first to publish an approach to
doing that.

Given what your desires appear to be, my guess is that your goal is to
try to have a parser which recovers gracefully from single-character
errors (e.g. missing closing quotes).

Now, if I may immodestly offer some advice having implemented several
parser generators headed roughly in that direction . . . .

The task is harder than it looks.

The main reason is that you have to combine several extensions to LR
parsing to get to your goal, and the complexity increases non-linearly
as the extensions have interactions.

For example, the original Yacc++ beta version supported the LALR
version of non-canonical SLR parsing (with regular expressions). In
the shipped version we switched to minimal state LR parsing (with
regular expressions). The non-terminal lookahead version is still not
incorporated into a shipped version. That is partly because the goals
have grown and now we would like the lookahead to be "infinite", plus
there are other already released complicating features like syntactic
predicates to consider.

However, a second reason why we haven't pursued non-canonical parsing
more actively is that it actually complicates the conflict reporting
process. The types of conflicts which can occur in a non-canonical
parser are more subtle than in a simple LR parser. Our experiments
suggest that writting and debugging a non-canonical grammar may
actually be more difficult (once you get past trivial grammars).

The third complicating issue is table size. No one implements LR
parsing tables using Knuth's canonical LR parsing algorithm--the
resulting tables grow exponentially with grammar size and are simply
too big. Most parser generators either restrict themselves to the
LALR subset or use a "minimal state" method, splitting the states only
when the context proves useful for resolving conflicts. Multiple
token lookahead is a similar problem. The generator needs to carefully
extend the lookahead only where it is useful to resolve conflicts.

The last point is where the subtleties all come together. We took our
C++ grammar, which is ambiguous because the language is, removed all
the precedence directives that normally guide the generator and fed it
to the non-canonical version of Yacc++ to determine exactly which of
the ambiguities in the grammar it would spot and whether it would
resolve any of them. The results were not pretty. After several
hours the generator had created many thousands of states (to resolve
conflicts that it could resolve) and numerous conflict messages for
ones that it couldn't and was not finished, but we were.

Part of the goal of implementing non-canonical parsing for us was to
simplify grammar writing by having the parser generator resolve more
conflicts automatically. However, by defering the conflicts until
later in the rules, non-canonical parsing made the conflicts which it
found more difficult to relate to the cause of the problem. As such,
it was a failure for us. Too many of our users take grammars of
questionable heritage and feed them to our tool expecting our tool to
resolve their problems magically (and sometimes it does, but when it
doesn't we can't have it making their life worse). We will continue
to work on non-canonical parsing, but until we think we have it
useable, it will remain a prototype.

Hope this helps, (and wasn't *too* discouraging)

Chris Clark Internet :
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
Web Site in Progress: Web Site :

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.