Parsing Cobol with yacc and lex

ejp@bohra.cpg.oz.au (Esmond Pitt)
Mon, 6 May 91 11:53:45 EST

          From comp.compilers

Related articles
Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-06)
Re: Parsing Cobol with yacc and lex kgs@dvncnms.cnms.dev.unisys.com (1991-05-15)
Re: Parsing Cobol with yacc and lex ejp@bohra.cpg.oz.au (1991-05-17)
| List of all articles for this month |
Newsgroups: comp.compilers
From: ejp@bohra.cpg.oz.au (Esmond Pitt)
Keywords: Cobol, parse, yacc
Organization: Compilers Central
Date: Mon, 6 May 91 11:53:45 EST

In article <1009@ima.ISC.COM> ceg@edsdrd.uucp (Carlos E. Galarce) writes:


> I am looking for a computer readable Cobol grammar. It would
> be nice if there is lex and/or yacc code to handle the grammar.


> [Is Cobol amenable to lex scanning? Seems to me that with some hackery for
> the indentation rules, it should be. Yacc parsing should be straightforward,
> though the grammar with all of the different keywords and reserved words
> would be enormous. -John]


Cobol is neither regular, context-free, nor LR(k) for any k. This makes
use of lex and yacc highly problematic. Discussion follows.


Lex: At least four scanning modes are required.
In addition to the default (normal) mode you need:


        (a) Comment-entry mode. Comment-entries in the Identification Division
        (AUTHOR etc) have their own lexical rules. (What they are is left as an
        exercise for the reader.)


        (b) PICTURE mode. X(120) is either a single PICTURE-string token or
        4 tokens representing an indexed identifier, depending on context.
        PICTURE mode is triggered by a preceding PIC(TURE)?(IS)?).


        (c) DECIMAL-POINT IS COMMA mode. This phrase changes the format of
        numeric literals.


The whole lexical process is greatly complicated by the rules for
continued identifiers, numeric literals and alpha literals. Also, you
have to lexically ignore sequence-number area and the area to the right of
margin R (which, incidentally, is undefined except by universal
agreement).


The rules about Area A (indentation) are formally unnecessary except when
looking for the end of a comment-entry. You still need to enforce these
rules, however ...


COPY REPLACING and REPLACE have their own significant peculiarities.


Yacc: Cobol is not LR(k) for any fixed k because of:


        (a) the WITH DATA phrase


        (b) the NOT {AT END/INVALID KEY/ON OVERFLOW/SIZE ERROR} phrases,


both of which are scope terminators requiring arbitrary lookahead, and


        (c) the syntax of abbreviated combined relational conditions.
        In one form of these the noise-word IS becomes syntactically
        significant, contrary to one of the stated objectives of Cobol-85.


Other problems:


        (a) Yes, the grammar is enormous. Cobol-85 has over 400 reserved words.


        (b) The syntax for the I/O statements (READ, REWRITE, WRITE,
        DELETE, ...) is dependent on the ACCESS MODE of the file named.
        Depending on the access mode, either an INVALID KEY or an AT END
        phrase is the legal syntactic continuation. This gets important in
        COBOL-85 with the arbitrary nesting allowed; otherwise your parser
        will tie e.g. an INVALID KEY phrase to the closest READ statement
        instead of the one it really belongs to, and completely mess up the
        syntactic scope.


        (c) Some forms of the PERFORM statement lead to ambiguities.


Although none of these are blocking difficulties, it should be clear
that lex/yacc-ing Cobol is a highly non-trivial task.


Best regards,


--
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.