Re: How to construct a grammar?

Chris F Clark <cfc@shell01.TheWorld.com>
Sat, 05 Jul 2008 11:38:18 -0400

          From comp.compilers

Related articles
How to construct a grammar? michael@schuerig.de (Michael Schuerig) (2008-07-05)
Re: How to construct a grammar? Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2008-07-05)
Re: How to construct a grammar? cfc@shell01.TheWorld.com (Chris F Clark) (2008-07-05)
Re: How to construct a grammar? michael@schuerig.de (Michael Schuerig) (2008-07-05)
Re: How to construct a grammar? cfc@shell01.TheWorld.com (Chris F Clark) (2008-07-06)
Re: How to construct a grammar? ArarghMail806@Arargh.com (2008-07-06)
Re: How to construct a grammar? quinn_jackson2004@yahoo.ca (Quinn Tyler Jackson) (2008-07-07)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: Sat, 05 Jul 2008 11:38:18 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 08-07-011
Keywords: parse, design
Posted-Date: 05 Jul 2008 13:03:50 EDT

Michael Schuerig <michael@schuerig.de> writes:


> So, I'm wondering, is there a systematic way to get from a suitable
> list of language samples, i.e, without a formal definition to start
> with, to a grammar?


There are two answers to that question. A theoretical automated
answer and a practical hand-done answer. I'm going to start with the
answer I suspect you care about less, the automated theoretical one
and dispense quickly with it. Taking a set of examples and generating
a grammar from itusing an algorithm is a field of research, which I
think is called roughly "machine learning". There are papers in that
field that give results positive and negative, saying what you can and
cannot make a machine learn from a set of examples. The results are
mostly the kinds of things one would expect. As I recall: You can
learn a finite regular expression grammars with no closure operators
(a* or a+) from positive examples only (i.e. this is legal). Added
negative examples (i.e. these are illegal), you can learn regular
expressions including closures. I think its been shown that there are
some CFGs you cannot learn with only positive and negative examples
alone, you can only learn a regular expression that matches a superset
of the CFG. I cannot recall where the deterministic CFGs fit in the
learnability specturm. But, again, I don't this this is your real
question.


In a practical sense, it is possible and usually actually even easy to
construct a grammar (by hand) from examples, as long as one has some
concept of the language. The easiest (and actualy most common case)
is that you have a grammar for a similar language accessible. If so,
you are only trying to cover the differences. However, since that
case is covered in the general case, I will skip it, and assume you
are starting from "first principles".


The first thing one needs is a "vocabulary", a list of things one is
talking about. This natuarally breaks down into two parts. You can
do either part first, so I'm picking ar order that is convenient for
me to talk about.


The first part of the vocabulary is the tokens or terminals. These
are the atomic parts of your grammar. You problably have identifiers
and keywords and numbers and operators and whitespace and similar
kinds of things in your grammar. So, make a list of all these things
are find examples of each of them. Don't worry about exactly how to
define them yet, just assume that you will be able to write a lexer
that can find them. When you get done with the whole process of
defining your language grammar you can roughly repeat the process for
the tokens and get a lexical grammar for them. More importantly, it
is even more likely that you will be able to find an existing lexer
grammar for your set of tokens, than it is to find a grammar that
describes your whole language. In fact, as I am defining new grammars
for new small tools that I design from time to time, I essentially use
one lexer grammar with some small permutations for all the tools. It
is so trivial, that I never even think about the minor variations that
I introduce. Thay are just "obvious", e.g. the list of keywords
changes.


(Shameless plug here, I believe that is partially true because the
lexer notation we defined in Yacc++ is particuarly natural for doing
that. We actually spent time thinking about that. However, we aren't
the only ones that have noticed that and there are some tools that
essentially have pre-canned lexers built into them, that you just
customize to get the desired result. A lot of the scannerless tools
go this way, because you can then sweep the ugly parts into the
predefined sections and paper over them so users don't notice what one
has done. We don't do that, but we do hide some of the ugliness in
default behaviors one can override, e.g. certain error checks are
turned off by default at the lexical level (for example, shift-reduce
conflicts between tokens that end in a star, because the longest match
rule implies always shift), but one can turn them back on.)


In any case, you have your list of atomic tokens. You then make a
list of the non-terminals, the structures your language is composed
of. You obviously have a "grammar", "program", of "file" non-
terminal, which represents the largest chunk you are parsing at a
time. That is your start symbol. Your language probably has
sub-groupings within that, like blocks, statements, and expressions.
Again, make a list of what they are and group the examples
accordingly.


Once, you have this done, the process is simply a case of taking each
non-terminal and looking at the examples and writing down the rule(s)
that turn those examples into non-terminals. In the process, you may
find you need to introdeuce additional non-terminals are tokens for
things you missed, but that isn't a major issue. Let's look at an
example of doing that.


Non-terminal: statement
Examples:
                a := b + c;
                if x = y then a := 1;


First guess at rules:
                statement: variable ":=" variable "+" variable ";"
                                  | if variable "=" variable then variable ":=" constant ";"
                                  ;


Now, we probably quickly realize that we want to use an expression
non-terminal to capture the various forms that part can take, so we
change the grammar to look like.


                statement: variable ":=" expression ";"
                                  | if boolean_expression then variable ":=" expression ";"
                                      // see the section after deubggin on my comment
                                      // on using boolean_expression here
                                  ;


We may also decide that we want to distinguish our types of statements
(i.e. new non-terminals), resulting in the following grammar.


                statement: assignment_statement
                                  | if_statement
                                  ;
                assignment_statement: variable ":=" expression ";"
                                  ;
                if_statement: if boolean_expression then variable ":=" expression ";"
                                  ;


Next, we may realizes from the examples that if_statements take a
statement after the then clause, resulting in:


                if_statement: if boolean_expression then statement
                                  ;


When, you have finished that exercise you have a grammar for your
language. Worth noting, is that that are often certain things one
wants in a language (likes lists separated by semicolons, certain
phrases which are optional). Those things often have an idiomatic
expression in the grammar notion of choice. In Yacc++, we like
regular expressions for mnay of the idioms and have a tutorial chapter
in our documentation that is simply a cookbook of some of the more
common idioms and the corresponding regular expression. In normal
yacc grammars, you will find the same idioms described by sets of
(often recursive) productions where some helper nonterminals are added
with names like expression_list. The _list says it is part of an
idiom describing lists of expressions. In either case, one learns the
idiomatic expressions and uses them, because they often have "good
properties". (BTW, I've never seen the corresponding yacc cookbook
(and have no interest in writing it), but if you read yacc grammars,
you will slowly notice the conventions.)


This gets us to the next point. Above I said you have a grammar, but
the grammar may or may not be suitable to your tool. If you've
written lots of grammars and have developed an intuition for how to
avoid problems, it may be acceptable as is. However, you need to test
it by running the grammar through your tool and seeing if the tool
generates errors. If so, your language isn't acceptible to your tool,
and may even have flaws where it doesn't even describe what you want.
Even if your grammar is acceptible to your tool, you need to test it
to make certain that it is actually what you want, by building a small
copy of you application that just lexes and parses input and running
relevant examples through to see if the examples are turned into the
right kind of parse tree (AST).


If your grammar isn't acceptible (or parses something incorrectly),
one has come to the hard part, debugging.


If the grammar parses something incorrectly, one needs to trace the
grammar and see where the parser has categorized something
incorrectly, and fix that rule. Sometimes the problem is that the
non-terminal you captured is context dependent and in one context some
phrase is legal, but in another context the phrase is not. In that
case, you may need to split the non-terminal into two distinct
non-terminals (one for each context). Alternately, you can have the
reverse problem, where you have two different non-terminals and they
both cover some case and if the parser picks the wrong one to apply it
misses the possibility of the other non-terminal and either
mis-recognizes the result or refues to recognize a valid case. Note
that if there are two non-terminals that are "similar" (the rules for
similarity differe for the different grammar classes, e.g. LL versus
LR) and can both apply at a particular point in a parse, the tool will
(should) catch this problem and give you a "conflict" message. This
message is to warn you that the resulting parser may not do what you
expect for some inputs. (This is one of my arguments for using a tool
or hand-written recursive descent, even if you want a recursive
descent output. Tools can perform correctness checks to help you not
make mistakes. By hand, you have to find those problems "some other
way".)


Of course, there are more problems and fixes than just splitting and
comibning rules--that's where the idiomatic expressions help. They
are often ways of writing a particular set of rules to describe a
particular language problem, such that the tool will likely find it
acceptible. For example, the rules on left-recursion (not if you want
LL), and/or right-recursion (can be "expensive" at run-time in LR) are
rules to avoid common problems.


Debugging and changing your grammar to accept exactly the language you
want is hard. Grammars can do some things well and easily. Other
things can take bizarre non-obvious sets of rules to describe. And,
somethings, cannot be desribed in practical grammars at all.


In fact, the use of boolean_expression above was exactly such a case.
Grammars are not good at enforcing type systems. Sometimes one can do
it, but it takes lots of extra rules to do so. Other times it cannot
be done at all. Moreover, even when one can do it and is will to have
extra rules, the resulting grammar is fragile and hard to modify.
When one encounters such problem things in a grammar, the rule of
thumb is to make the grammar accept invalid programs (generalize what
you accept) and then add semantic checks to your compiler that catches
the invalid cases. Thus, in the above case a better grammar would
have been:


                if_statement: if expression then statement
                                  ; // type check that expression returns a boolean


Second shamelss plug, a few years back I wrote for Sigplan NOTICES, a
serious of columns on resolving some of the grammar problems and
tricky cases. The columns try to capture some of the intuition and
rules of thumb that aren't always described in textbooks. (The column
died because I ran out of time to write it, but someday I may get back
to it, because I didn't run out of material.) If you are seriously
interested in writing grammars, I recommend reading those columns,
there are about 10 of them, and they are only about 5 pages each.
(They all read something like this post, because that's my writing
style, for better or worse.)


In any case,


Hope this helps,
-Chris


******************************************************************************
Chris Clark Internet: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. or: compres@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.