SGML to YACC

j_mcarthur@BIX.com (Jeffrey McArthur)
Thu, 26 May 1994 02:45:59 GMT

          From comp.compilers

Related articles
SGML to YACC j_mcarthur@BIX.com (1994-05-26)
Re: SGML to YACC vosse@ruls41.LeidenUniv.nl (1994-05-30)
Re: SGML to YACC glyph@netcom.com (1994-05-31)
Anouncement: ETO, an object oriented universal syntax checker china@bernina.ethz.ch (1994-06-03)
| List of all articles for this month |
Newsgroups: comp.compilers
From: j_mcarthur@BIX.com (Jeffrey McArthur)
Keywords: parse, question
Organization: Galahad
Date: Thu, 26 May 1994 02:45:59 GMT

I have been working on a strange project involving Yacc and SGML. What I
want to be able to do is compile a SGML Document Tag Definition into a
YACC grammar.


I have done some of this. A lot of SGML is straightforward. But then I
come to what SGML fondly refers to as exceptions. For any tag/token there
is a set of context dependant exceptions to the grammar. I get a bit lost
on how to easilly reduce this to Yacc's grammar.


For example:


<!ELEMENT foo - - (a, b*, c) >


Is the declaration for the tag/token foo. It says that foo is defined as
the sequence of tags/tokens: exactly one <a> token, any number of <b>
tokens, followed by one <c> token. SGML uses the same regular expression
syntax as Lex.


I convert the above declaration into this:


sy_foo : sy_a b_ast sy_c
              ;


b_plus : sy_b
              | b_plus sy_b
              ;


b_ast : /* empty */
              | b_plus
              ;


The problem is exceptions make the whole thing a lot more complicated.
For example:


<!ELEMENT foo - - (a, b*, c) +(d | e) >


Means that the tags/tokens <d> and <e> can occur anywhere. My first pass
was to do something like this:




sy_foo : foo_inc_ast sy_a foo_inc_ast b_ast foo_inc_ast sy_c
foo_inc_ast
              ;


b_plus : sy_b
              | b_plus sy_b
              ;


b_ast : /* empty */
              | b_plus
              ;


sy_foo_inc : sy_d
                | sy_e
                ;


foo_inc_plus : sy_foo_inc
                          | foo_inc_plus sy_foo_inc
                          ;


foo_inc_ast : /* empty */
                          | foo_inc_plus
                          ;


Unfortunately that does not work. Since the sequence:


    <a><b><d><b><b><c>


should also be allowed inside of <foo>. This means the grammar actually
has to be something like this:


sy_foo : foo_inc_ast sy_a foo_inc_b_ast sy_c foo_inc_ast
              ;


sy_foo_inc_b : sy_foo_inc
                          | sy_b
                          ;


foo_inc_b_plus : sy_foo_inc_b
              | foo_inc_b_plus sy_foo_inc_b
              ;


foo_inc_b_ast : /* empty */
              | foo_inc_b_plus
              ;


sy_foo_inc : sy_d
                | sy_e
                ;


foo_inc_plus : sy_foo_inc
                          | foo_inc_plus sy_foo_inc
                          ;


foo_inc_ast : /* empty */
                          | foo_inc_plus
                          ;


Then of course there are the problem of the exclusion exceptions. But I
think I have presented the idea.


The problem I am having is the grammar grows almost exponentially when
exceptions are fully implemented. Or am I missing some simple construct?


----
        Jeffrey McArthur ATLIS Publishing
        phone: (301) 210-6655 12001 Indian Creek Court
        fax: (301) 210-4999 Beltsville, MD 20705
        home: (410) 290-6935 email: j_mcarthur@bix.com
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.