Re: How to extract grammar from a program?

hunk@alpha1.csd.uwm.edu (Mark William Hopkins)
15 Feb 1999 23:30:46 -0500

          From comp.compilers

Related articles
How to extract grammar from a program? rahulj@iitk.ac.in (Rahul Jain) (1999-02-05)
Re: How to extract grammar from a program? cfc@world.std.com (Chris F Clark) (1999-02-10)
Re: How to extract grammar from a program? derekross@fisheracre.freeserve.co.uk (Derek Ross) (1999-02-12)
Re: How to extract grammar from a program? eodell@pobox.com (1999-02-15)
Re: How to extract grammar from a program? dib@dera.gov.uk (David Bruce) (1999-02-15)
Re: How to extract grammar from a program? spencer@cc.gatech.edu (1999-02-15)
Re: How to extract grammar from a program? hunk@alpha1.csd.uwm.edu (1999-02-15)
Re: How to extract grammar from a program? hunk@alpha1.csd.uwm.edu (1999-02-15)
Re: How to extract grammar from a program? x@wins.uva.nl (1999-12-09)
| List of all articles for this month |

From: hunk@alpha1.csd.uwm.edu (Mark William Hopkins)
Newsgroups: comp.compilers
Date: 15 Feb 1999 23:30:46 -0500
Organization: University of Wisconsin - Milwaukee, Computing Services Division
References: 99-02-025 99-02-049
Keywords: parse

>There are formal theories that address this. However, their results
>are far from encouraging. The essential problem is that given a finite
>set of programs, there is a trivial regular expression which recognizes
>exactly those set of programs and no others.


Actually, this isn't the problem. The problem is that the approach
taken doesn't treat the problem as a statistical inference problem but
instead as a logical inference problem. It's the latter approach that
is characterized by the inability to avoid degenerating into the
result you cited:


>Namely, regexp: program-1 | program-2 | ... | program-n;


that is, thus, characterized by the fact that this even presents a problem.


In practice, it should not be much more difficult to statistically
model a set of structured inputs by a system of productions (Chomsky
Normal form A -> B C would be ideal) and in constructing any other
kind of statistical model.


For example: the best split of a given A into B C would be given by a
simple criterion, such as that of maximizing the product of
frequencies #(B) #(C) over all combinations for which A = BC. A more
general approach might be based on minimizing some
information-theoretic measure based on the sample set.


Actually, the method would serve as a compression method too. It's
really nothing more than a generalization of the Lempel-Ziv algorithm.


The hard part isn't the approach. It's the Machine Bottlenecks: its
Lack of Parallelism and its Lack of Memory, which humans have in
abundance. The approach for inducting the rules A -> B C is going to
take up a huge amount of memory because of all the table-keeping over
the combinations (it's k N^3, where N is the size of the largest
sample and k the number of samples). The number of samples that'll
need to be incorporated is large -- which gets to the other
bottleneck: the machine can only read and process one at a time, and
(even worse) only one step at a time, because it has no parallelism.


So methods are not an issue. The primitive state of current hardware
technology is.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.