Re: Parsing techniques for generating sentences from grammar

adrian@dcs.rhbnc.ac.uk (A Johnstone)
19 Jul 1999 01:18:41 -0400

          From comp.compilers

Related articles
Parsing techniques for generating sentences from grammar amitr@abinfosys.com (Amit Rekhi) (1999-07-10)
Re: Parsing techniques for generating sentences from grammar jejones@microware.com (James Jones) (1999-07-14)
Re: Parsing techniques for generating sentences from grammar adrian@dcs.rhbnc.ac.uk (1999-07-19)
Re: Parsing techniques for generating sentences from grammar peter.r.wilson@boeing.com (Peter Wilson) (1999-07-28)
| List of all articles for this month |

From: adrian@dcs.rhbnc.ac.uk (A Johnstone)
Newsgroups: comp.compilers
Date: 19 Jul 1999 01:18:41 -0400
Organization: Royal Holloway, University of London
References: 99-07-036 99-07-050
Keywords: parse, testing

Funnily enough I am just playing around with some of these ideas for
the grammar analysis and maniplulation tool that we are currently
developing. The problem with depth-first (a la recursive descent)
generation of sentences is that you get sample strings that are very
much oriented to the left side of the grammar rules (assuming that you
are doing a left-most expansion). What may be less clear is that even
a breadth first search comes out with strings that look left-biased.
Consider the grammar


A -> BA | <epsilon>


B -> 'a' | 'b' | 'c' | 'd' | 'e' .


When I feed this in to my breadth-first generater I get


  <epsilon>
  a
  b
  c
  d
  a a
  a b
  a c
  a d
  b a
  b b


and so on. The number of sentential forms grows _very_ rapidly
(exponentially in the number of symbols in each production as shown by
the simple example above) and so any real machine rapidly runs out of
storage and you only ever see a very small sample of strings from near
the beginning of the sequence. Since the leftmost terminals are
cycling most rapidly, and since you don't get to see many strings, the
effect is that you don't get to see what happens as the rightmost ones
cycle.


To give you some concrete examples, I fed both the above grammar and
the ANSI-C grammar into my breadth-first generator and ran it until it
collapsed through lack of memory (on a poorly endowed PC with only
32MByte RAM running Windows-95, admittedly).


The simple grammar (above) generated 196,349 sentential forms of which
190,345 were terminal strings, none of which was more than nine
terminals long.


The ANSI-C grammar generated 145,006 sentential forms of
which only 4544 were terminal strings, none of which are more than six
terminals long.


The message here is that even simple languages are huge in the sense
of there being large numbers of short strings. (Someone with more
formal training than I have would probably cringe at this: after all
both of these languages are huge because they are infinite...)


As a result, any attempt to enumerate a language exhastively is futile
if what you want is to get out some of the long strings (say, ones of
a few thousand tokens in length: my empirical studies show that real C
programs have around 4-6 tokens per line on average depending on the
programmer).


This is where the probabilities come in. Imagine a depth first
traversal of the language that randomly selected one of the
productions for a nonterminal at each generation step. Here you have
got a decent chance of generating a long string before your memory
fills up with sentential forms. If you are interested in producing
strings that are similar to real-world programs then you could parse a
whole lot of code and generate frequencies to attach to each
production. These frequencies could then be used to guide the
expansion process. These kinds of grammar is called a probabilistic or
stochastic grammar. They were studied by some formal language people
in the seventies and eighties, often with a view to improving the
performance of error correcting compilers. Wetherell did a PhD on this
in the mid-1970's: the paper mentioned in the previous post is a very
readable review of the field in 1980. Natural Language Parsing people
also use stochastic grammars although I haven't studied any of that
stuff yet. I am presently adding a probabilistic depth-first
generator so that we can produce `random' lonmg strings from real
grammars, such as C. One fine day I'll add in a parser that collects
frequencies so that we can produce `human-like' strings.


The full reference for Wetherell's paper is


    C. S. Wetherell 'Probalistic languages: a review and some open questions'
    ACM Computing Surveys, Vol 12, Num 4 december 1980 pp 361-380.


If anybody wants to discuss these kinds of issues, or get access to our tools
then email me at the address below.


                                            Adrian


James Jones (jejones@microware.com) wrote:
: Seems to me that it would be easy to turn a recursive descent parser
: into a sentence generator of the sort desired; where the parser
: expects to see a token, it should instead print it. Wherever there's
: a * or +, you'd need some way to keep it from going on forever (flip a
: coin or extend the grammar syntax to let the user attach probabilities
: to alternatives, the latter being of wider applicability than just
: preventing endless recursion). Not that this is new stuff; check out
: Charles Wetherell's paper on the subject in a _Computing Surveys_
: issue back in the early 1980s.
--
Dr Adrian Johnstone, Senior Lecturer in Computing, Computer Science Dep,
Royal Holloway, University of London, Egham, Surrey, TW20 0EX, England.
Email a.johnstone@rhbnc.ac.uk Tel:+44(0)1784 443425 Fax:+44(0)1784 439786


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.