Re: Parsing Cobol with yacc and lex

ejp@bohra.cpg.oz.au (Esmond Pitt)
Fri, 17 May 1991 03:02:19 GMT

          From comp.compilers

Related articles
Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-06)
Re: Parsing Cobol with yacc and lex kgs@dvncnms.cnms.dev.unisys.com (1991-05-15)
Re: Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-17)
| List of all articles for this month |

Newsgroups: comp.compilers
From: ejp@bohra.cpg.oz.au (Esmond Pitt)
Keywords: Cobol, parse, yacc
Organization: Software Division, Computer Power Group
References: <9105060154.AA04463@bohra.cpg.oz.au> <91-05-082@iecc.cambridge.ma.us>
Date: Fri, 17 May 1991 03:02:19 GMT

In article <91-05-082@iecc.cambridge.ma.us> kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) writes:
[following-up my posting on COBOL]


> Lex should categorize the words it scans with enough granularity that
> yacc is not confused. Candidate tokens are: DN_only, PIC_only,
> DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens
> regardless of context. Context is a parser function. Yacc will
> need productions of the form:
> DataName : DN_only | DN_or_PIC ;
> PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ;


If I have understood correctly, this would mean that the DN_or_PIC rule
has to unpick the single token returned by yylex() into the dataname part,
the left brace, the index, and the right brace, and then parse the result,
all without benefit of lex, yacc, clergy, ... You can do it all right, but
the result is not exactly lex+yacc parsing.


It's more like keeping a dog and barking yourself. I prefer the scanner to
scan and the parser to parse. My own solution to this was to have the
parser switch Lex's start states whenever a picture-string is expected.


>> Yacc: Cobol is not LR(k) for any fixed k because of:
>> (a) the WITH DATA phrase
>> (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,
>
> These phrases are no more complex than IF ... THEN ... ELSE, and not
> much more complicated than parenthesized expressions.


They are not _complicated_ at all, but they all contain optional
noisewords, and they all require lookahead of > 1 token beyond the RHS of
a production. You get reduce/reduce conflicts.


The statement that COBOL is not LR(1) because of these elements was made
to me by a member of the ANSI X3-23.1985 committee. I don't have details
on-line; maybe I'll be able to followup with these later.


There are solutions to these problems; the task has been accomplished
several times. I've done it myself.


My point was that the task is quite a bit more complex than just sitting
down and hacking out a lex & yacc script as you might expect, and as the
comp.compilers monthly message used to say. Most of this is because the
basic structures of the language date from before 1960, i.e. before
compiler theory really got going, and the subsequent revisions to the
standard have not really addressed compiler-theoretic issues.


- --
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.