Re: How to extract grammar from a program?

Chris F Clark <>
10 Feb 1999 18:13:45 -0500

          From comp.compilers

Related articles
How to extract grammar from a program? (Rahul Jain) (1999-02-05)
Re: How to extract grammar from a program? (Chris F Clark) (1999-02-10)
Re: How to extract grammar from a program? (Derek Ross) (1999-02-12)
Re: How to extract grammar from a program? (1999-02-15)
Re: How to extract grammar from a program? (David Bruce) (1999-02-15)
Re: How to extract grammar from a program? (1999-02-15)
Re: How to extract grammar from a program? (1999-02-15)
Re: How to extract grammar from a program? (1999-02-15)
[1 later articles]
| List of all articles for this month |

From: Chris F Clark <>
Newsgroups: comp.compilers
Date: 10 Feb 1999 18:13:45 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 99-02-025
Keywords: parse

Rahul Jain <> writes:
> I'm working on a project for which I need some information about some
> reverse engineering method that would help me extract the grammar from a
> set of programs (written in any language). A sufficient grammar will be
> the one which is able to parse all the programs.
> Now, the question is - Does there exist some formal theory for getting
> the grammar from a program. Any heuristic approaches would also solve the
> purpose.

There are formal theories that address this. However, their results
are far from encouraging. The essential problem is that given a finite
set of programs, there is a trivial regular expression which recognizes
exactly those set of programs and no others. Namely,
regexp: program-1 | program-2 | ... | program-n;

That doesn't tell you very much. Trying to extract where the set
describes a larger language (i.e. where to put in * (Kleene Closure)
operators for repetitions) is very difficult. If you really want the
theory, I think the topic you are looking for is called "machine
learning" and you'll find it covered in some of the journals with a
more mathematical bent (say SIAM Journal). Many of the results are
negative, showing what you can't learn, but there are positive results
too, showing that under certain conditions one can find general

At a pragmatic level, if you just have a set of programs in some
programming language, say "PLQ", that you don't know the grammar for.
It is probably possible to construct a compiler for it by assuming
that it is syntactically similar to all other programming languages
with only a few wrinkles and thus taking some other language
translator and tweaking it until the set of programs in question goes
through. The "big" differences in many programming language syntaxes
tend to be things like whether if and loop statements take statements
or blocks (and whether there is a closing "fi" at the end) and whether
the type comes before the variable (int *c) or after (var c: ^int).
The rest comes down to fiddling with the expression operators and
their precedences, getting the comment syntax right, and having the
right set of reserved words. Not that those nits can't be both
annoying and significant.

However, I would be hard pressed to write an *automatic* recognizer for
grammars unless the languages belonged to a small set, say the
C-family or Pascal derivatives or FORTRAN dialects; as the differences
between languages in the C-family to FORTRAN or COBOL or ML or Scheme
are sufficient to inviolate all of my statements above. Even within a
language family it is hard enough, which is why a lot of reverse
engineering tools use "fuzzy parsing" or similar techniques which
essentially means they accept a superset of the dialects they wish to
parse and ignore the differences between the languages.

Note that I have not addressed semantics. There the differences in
languages is more pronounced. If you really want to compile and
execute those programs, you will still have a big task even if you
parse them accurately, because the big difference in most programming
languages is not in their syntax but in their semantics and run-time
libraries. A great example of this is the Pascal-to-C translators.
Most of the translation is nearly trivial, except for things like
sets, strings, variant records, pointers, and I/O. There the
translation gets very tortured because the semantics of the two
languages don't match and it is impossible for the translator to tell
whether the program is depending upon the subtlety or whether a more
coarse but natural translation would suffice.

Hope this helps,

Chris Clark Internet :
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
Web Site in Progress: Web Site :

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.