Re: Error reporting in LR parsers

heirich@cs.ucsd.edu (Alan Heirich)
8 Aug 89 03:56:55 GMT

          From comp.compilers

Related articles
Error reporting in LR parsers worley@compass.com (1989-08-02)
Re: Error reporting in LR parsers djones@megatest.uucp (1989-08-04)
Re: Error reporting in LR parsers lai@mips.com (1989-08-07)
Re: Error reporting in LR parsers heirich@cs.ucsd.edu (1989-08-08)
Re: Error reporting in LR parsers heirich@cs.ucsd.edu (1989-08-08)
Re: Error reporting in LR parsers rusty@garnet.Berkeley.EDU (1989-08-10)
Re: Error reporting in LR parsers djones@megatest.com (1989-08-10)
Re: Error reporting in LR parsers djones@megatest.com (1989-08-10)
Re: Error reporting in LR parsers djones@megatest.com (1989-08-10)
Re: Error reporting in LR parsers markg@well.sf.ca.us (1989-08-15)
Re: Error reporting in LR parsers eachus@mbunix.mitre.org (1989-08-14)
| List of all articles for this month |

From: heirich@cs.ucsd.edu (Alan Heirich)
Newsgroups: comp.compilers
Date: 8 Aug 89 03:56:55 GMT
References: <1989Aug4.032053.753@esegue.uucp> <1989Aug6.024931.10014@esegue.uucp>
Organization: EE/CS Dept. U.C. San Diego

This is the second of two postings. The first posting dealt with objections
to automatic error recovery in yacc. This posting describes modifications to
DECUS yacc to permit automatic diagnostic generation.


In an earlier message I mentioned that I have modified yacc to provide useful
diagnostic messages to describe the parse (for debugging a grammar) and to
describe the expected tokens when a syntax error occurs (to debug a user).
I received many requests for the source code. I have been advised that the
version of DECUS yacc which I modified is in fact pirate AT&T code, so I don't
think I can redistribute it. But I will describe the mods, which are easy to
make. [If you have AT&T source, these changes should be straightforward to
apply. -John]


The general goal is to encode the information from the parser description
file, describing states, goto sets, and lookahead sets, in a compact form
that can be accessed at run time. Once you have done this you can use any
number of schemes of varying complexity to generate messages describing
the parse process and the expectations of the parser. I will leave most
of this to the ingenuity of the reader, and simply describe how to get the
essential information out of the yacc data structures.


The changes are nearly all in the routine "output". This routine writes out
the parser description to the description file. You will want to modify it
to write out five pieces of information:


    (1) a set of strings containing token names
    (2) a set of strings containing nonterminal names
    (3) a set of states containing items sets
    (4) a set of states containing lookahead sets
    (5) a set of states containing goto sets


(1) and (2) are easy. The token names are in an array of records.
tokset[i].name is the string containing the name of token #i. Use the
TLOOP macro on i to print out all the token names, i.e.
      TLOOP(i) { printf ("\"%s\",\n", tokset [i].name); }
Since TLOOP starts at 1, print a null string before you start this loop to
account for token #0 (which apparently is nonexistent as far as yacc is
concerned).
      The nonterminal names are in another array of records, nontrst. Use the
NTLOOP macro, which starts at 0, as follows:
      NTLOOP(i) { printf ("\"%s\",\n", nontrst [i].name); }
    The item sets are useful so that the run time parser can show you a
description of the current state, rather than just emit a message "in state x".
You guessed it, the SLOOP macro steps through all of the states:
      SLOOP(i) { writeState (i); }
the writeState function (which you must create) is a bit trickier than simply
printing out a string. Copy the code from the beginning of output() which
writes out the state description to the description file, then selectively
delete to get what you want. You should end up with approximately the
following:
SLOOP(i)
      {
      nolook = !(tystate [i] == MUSTLOOKAHEAD);
      closure(i);
      nolook = 1;
      ITMLOOP(i,pp,qq) { writeItem (pp-> pitem); printf ("-99999,\n"); }
      if (tystate [i] == MUSTLOOKAHEAD)
                  WSLOOP (wsets + (pstate[i+1]-pstate[i]), u)
                        if (*(u->pitem) < 0)
                              {
                              writeItem (u->pitem); printf ("-99999,\n");
                              } /* if */
    What is happening here? Well, we first recompute closure for the state.
That sets important flags in the tystate array. We then loop through all
the normal items in the state, thanks to ITMLOOP, and write them out (more
on this below). Then, if the state requires lookahead we write out another
group of items (exactly what they are escapes me at the moment. I think
they are items that reduce on the empty string, or something like that).
      What must writeItem(pp) do? Approximately this (again, adapted from
the code that writes out the state descriptions to the description file):


writeItem (pp) int *pp;
      {
      int *p;
      for (p = pp; *p > 0; ++p)
            { } /* (skip ahead until *p > 0) */
      p = prdptr [-*p];
      printf ("%d,", *p);
      while (1)
      {
      p++;
      if (p == pp) printf ("-88888,");
      if (*p <= 0) break; else printf ("%d,", *p);
      } /* while */
      if (*pp < 0) { printf ("%d,", *pp); }
      } /* writeItem */


This writes out the elements of a single item, encoded as integers.
-88888 represents the location of the . (or underscore) in the item, i.e. the
current position of the parse. A positive number is the number of a token
or nonterminal name; if the number is > NTBASE (a yacc constant) then it
represents a nonterminal (and NTBASE must be subtracted to get the number
of the nonterminal); if the number is < NTBASE then it can be used directly
as an index into the token name table (described above). A negative number
indicates a rule that is the be reduced; it always appears at the end of
an item, and is negated (to get the actual rule number, just negate the
value printed out).
      Notice that after writeItem is called, the value -99999 is printed out.
This is simply an end-of-item marker.
      For (4), the lookahead sets: use SLOOP again, and for each state i,
compute closure(i). Well, here is the code, including some important details:


      SLOOP(i)
      {
      nolook = !(tystate [i] == MUSTLOOKAHEAD);
      closure (i);
      nolook = 1;
      aryfil (temp1, ntokens + nnonter + 1, 0);
      WSLOOP (wsets,u)
            {
            c = *(u -> pitem);
            if (c > 1 && c < NTBASE && temp1 [c] == 0)
                  {
                  printf ("%d,", c);
                  temp1 [c] = 1;
                  } /* if */
            } /* WSLOOP */
      } /* SLOOP */


      Finally, (5), the goto sets:


      SLOOP(i)
      {
      nolook = !(tystate [i] == NOLOOKAHEAD);
      closure (i);
      nolook = 1;
      aryfil (temp1, ntokens + nnonter + 1, 0);
      WSLOOP (wsets, u)
            {
            c = *(u -> pitem);
            if (c > NTBASE && temp1 [(c -= NTBASE) + ntokens] == 0)
                  {
                  temp1 [c + ntokens] = 1;
                  printf ("%d,", c);
                  } /* if */
            } /* WSLOOP */
      } /* SLOOP */


This basically does it. What you get from all this is a lot of printed
information. Write this to a text file and use it to initialize a set
of arrays which you compile into your parser. You can then access the
information while your parser runs. This allows it to describe states
in the same form as in the description file; describe the expected lookahead
tokens or goto items (i.e. the expected nonterminals, which are often much
more explanatory than the lookahead tokens); and describe the tokens as
they are obtained from yylex (very helpful).


This method certainly isn't perfect -- the biggest problem is that an end
user can be overwhelmed with information, since a programming language like
C or Pascal can generate states with 30 valid lookahead tokens. (Using
the nonterminal names from the goto sets can help in this situation,
although only partially). Still, it's a useful start, and does provide
helpful information. Also, be aware that program errors tend to be static
semantic errors rather than syntax errors, so the gravity of the situation
is somewhat less than one might think...


But it is true, what we all *really* want is a son-of-yacc (yayacc?)
which allows us to specify textual error messages from within the grammar...
and which doesn't generate stupid reduce/reduce conflicts when we put
lots of error productions into the grammar!


-------------------------
Alan Heirich Comp. Sci. & Eng., Cognitive Science
C-014 University of California, San Diego 92093


heirich@cs.ucsd.edu
aheirich@ucsd.bitnet





Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.