Related articles |
---|
Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-06) |
Re: Parsing Cobol with yacc and lex kgs@dvncnms.cnms.dev.unisys.com (1991-05-15) |
Re: Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-17) |
Newsgroups: | comp.compilers |
From: | kgs@dvncnms.cnms.dev.unisys.com (Kenneth G. Salter) |
Summary: | Parsing COBOL with yacc seems feasible. |
Keywords: | Cobol, parse, yacc |
Organization: | Compilers Central |
References: | <9105060154.AA04463@bohra.cpg.oz.au> |
Date: | Wed, 15 May 91 00:07:51 GMT |
This is my first post. I'd like to comment on a response by Esmond Pitt
to a post by Carlos E. Galarce. While building a COBOL compiler is a
large task, prospects are not as bleak as one might think.
In article <9105060154.AA04463@bohra.cpg.oz.au>, ejp@bohra.cpg.oz.au (Esmond Pitt) writes:
> Cobol is neither regular, context-free, nor LR(k) for any k. This makes
> use of lex and yacc highly problematic. Discussion follows.
Ignoring semantics and focusing just on parsing, I think that the
Identification, Environment, and Data divisions are regular, and the
Procedure division is LR(1).
> Lex: At least four scanning modes are required.
> In addition to the default (normal) mode you need:
> (a) Comment-entry mode. Comment-entries in the Identification Division
> (AUTHOR etc) have their own lexical rules. ...
> (b) PICTURE mode. X(120) is either a single PICTURE-string token or
> 4 tokens representing an indexed identifier, depending on context.
> PICTURE mode is triggered by a preceding PIC(TURE)?(IS)?).
Lex should categorize the words it scans with enough granularity that
yacc is not confused. Candidate tokens are: DN_only, PIC_only,
DN_or_PIC, INT_or_PIC, etc. Lex should recognize these tokens
regardless of context. Context is a parser function. Yacc will
need productions of the form:
DataName : DN_only | DN_or_PIC ;
PIC_string : PIC_only | DN_or_PIC | INT_or_PIC ;
> (c) DECIMAL-POINT IS COMMA mode. This phrase changes the format of
> numeric literals.
Rather than try to hard-code the definition of a number with a regular
expression, a word that looks like a number can trigger a lex function
that scans carefully, being sensitive to whether DECIMAL-POINT IS COMMA
has appeared.
> The whole lexical process is greatly complicated by the rules for
> continued identifiers, numeric literals and alpha literals. Also, you
> have to lexically ignore sequence-number area and the area to the right of
> margin R (which, incidentally, is undefined except by universal
> agreement).
You could front-end lex with a filter that erases columns 1-6 and 73-80,
and that replaces comment lines by blank lines. Or, you might write a
input() function for lex to do the above and to merge continued lines into
a single line so that that the parser need not deal with continuation.
This input() might also solve the column sensitivity problem by prefixing
non-blank text appearing in Areas A and C with distinguished characters.
All of these has to be done in a way that does not result in loss of the
actual source line numbers.
> The rules about Area A (indentation) are formally unnecessary except when
> looking for the end of a comment-entry. You still need to enforce these
> rules, however ...
See above.
> COPY REPLACING and REPLACE have their own significant peculiarities.
> Yacc: Cobol is not LR(k) for any fixed k because of:
> (a) the WITH DATA phrase
> (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,
> both of which are scope terminators requiring arbitrary lookahead, ...
These phrases are no more complex than IF ... THEN ... ELSE, and not
much more complicated than parenthesized expressions.
> (c) the syntax of abbreviated combined relational conditions.
> In one form of these the noise-word IS becomes syntactically
> significant, contrary to one of the stated objectives of Cobol-85.
> Other problems:
>
> (a) Yes, the grammar is enormous. Cobol-85 has over 400 reserved words.
> (b) The syntax for the I/O statements (READ, REWRITE, WRITE,
> DELETE, ...) is dependent on the ACCESS MODE of the file named.
> Depending on the access mode, either an INVALID KEY or an AT END
> phrase is the legal syntactic continuation. This gets important in
> COBOL-85 with the arbitrary nesting allowed; otherwise your parser
> will tie e.g. an INVALID KEY phrase to the closest READ statement
> instead of the one it really belongs to, and completely mess up the
> syntactic scope.
Good point. I'd have missed this. For initial efforts, I'd ignore it,
reasoning that the COBOL programmer can avoid the problem by delimiting
each READ with an END-READ. Long term, you'd need yacc actions to
straighten out the parse tree as soon as yacc reduces INVALID KEY, etc.
I haven't taken the time to address all the issues, just the easier ones.
Hope this helps.
--
Kenneth G. Salter, Unisys Corporation
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.