Re: Parsing Cobol with yacc and lex

kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter)
Wed, 15 May 91 00:07:51 GMT

          From comp.compilers

Related articles
Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-06)
Re: Parsing Cobol with yacc and lex kgs@dvncnms.cnms.dev.unisys.com (1991-05-15)
Re: Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-17)
| List of all articles for this month |
Newsgroups: comp.compilers
From: kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter)
Summary: Parsing COBOL with yacc seems feasible.
Keywords: Cobol, parse, yacc
Organization: Compilers Central
References: <9105060154.AA04463@bohra.cpg.oz.au>
Date: Wed, 15 May 91 00:07:51 GMT

This is my first post. I'd like to comment on a response by Esmond Pitt
to a post by Carlos E. Galarce. While building a COBOL compiler is a
large task, prospects are not as bleak as one might think.


In article <9105060154.AA04463@bohra.cpg.oz.au>, ejp@bohra.cpg.oz.au (Esmond Pitt) writes:
> Cobol is neither regular, context-free, nor LR(k) for any k. This makes
> use of lex and yacc highly problematic. Discussion follows.


Ignoring semantics and focusing just on parsing, I think that the
Identification, Environment, and Data divisions are regular, and the
Procedure division is LR(1).


> Lex: At least four scanning modes are required.
> In addition to the default (normal) mode you need:


> (a) Comment-entry mode. Comment-entries in the Identification Division
> (AUTHOR etc) have their own lexical rules. ...


> (b) PICTURE mode. X(120) is either a single PICTURE-string token or
> 4 tokens representing an indexed identifier, depending on context.
> PICTURE mode is triggered by a preceding PIC(TURE)?(IS)?).


Lex should categorize the words it scans with enough granularity that
yacc is not confused. Candidate tokens are: DN_only, PIC_only,
DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens
regardless of context. Context is a parser function. Yacc will
need productions of the form:
DataName : DN_only | DN_or_PIC ;
PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ;


> (c) DECIMAL-POINT IS COMMA mode. This phrase changes the format of
> numeric literals.


Rather than try to hard-code the definition of a number with a regular
expression, a word that looks like a number can trigger a lex function
that scans carefully, being sensitive to whether DECIMAL-POINT IS COMMA
has appeared.


> The whole lexical process is greatly complicated by the rules for
> continued identifiers, numeric literals and alpha literals. Also, you
> have to lexically ignore sequence-number area and the area to the right of
> margin R (which, incidentally, is undefined except by universal
> agreement).


You could front-end lex with a filter that erases columns 1-6 and 73-80,
and that replaces comment lines by blank lines. Or, you might write a
input() function for lex to do the above and to merge continued lines into
a single line so that that the parser need not deal with continuation.
This input() might also solve the column sensitivity problem by prefixing
non-blank text appearing in Areas A and C with distinguished characters.
All of these has to be done in a way that does not result in loss of the
actual source line numbers.


> The rules about Area A (indentation) are formally unnecessary except when
> looking for the end of a comment-entry. You still need to enforce these
> rules, however ...


See above.


> COPY REPLACING and REPLACE have their own significant peculiarities.


> Yacc: Cobol is not LR(k) for any fixed k because of:
> (a) the WITH DATA phrase
> (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,
> both of which are scope terminators requiring arbitrary lookahead, ...


These phrases are no more complex than IF ... THEN ... ELSE, and not
much more complicated than parenthesized expressions.


> (c) the syntax of abbreviated combined relational conditions.
> In one form of these the noise-word IS becomes syntactically
> significant, contrary to one of the stated objectives of Cobol-85.


> Other problems:
>
> (a) Yes, the grammar is enormous. Cobol-85 has over 400 reserved words.


> (b) The syntax for the I/O statements (READ, REWRITE, WRITE,
> DELETE, ...) is dependent on the ACCESS MODE of the file named.
> Depending on the access mode, either an INVALID KEY or an AT END
> phrase is the legal syntactic continuation. This gets important in
> COBOL-85 with the arbitrary nesting allowed; otherwise your parser
> will tie e.g. an INVALID KEY phrase to the closest READ statement
> instead of the one it really belongs to, and completely mess up the
> syntactic scope.


Good point. I'd have missed this. For initial efforts, I'd ignore it,
reasoning that the COBOL programmer can avoid the problem by delimiting
each READ with an END-READ. Long term, you'd need yacc actions to
straighten out the parse tree as soon as yacc reduces INVALID KEY, etc.


I haven't taken the time to address all the issues, just the easier ones.
Hope this helps.
--
Kenneth G. Salter, Unisys Corporation


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.