Re: Language inference tool ?

glen herrmannsfeldt <gah@ugcs.caltech.edu>
Thu, 2 Dec 2010 21:59:34 +0000 (UTC)

From comp.compilers

Related articles
Language inference tool ? pocky6@gmail.com (Vivien Parlat) (2010-11-29)
Re: Language inference tool ? gneuner2@comcast.net (George Neuner) (2010-12-01)
Re: Language inference tool ? torbenm@diku.dk (2010-12-01)
*Re: Language inference tool ? gah@ugcs.caltech.edu (glen herrmannsfeldt)* (2010-12-02)**
Re: Language inference tool ? valeriocavadini@hotmail.com (Valerio Cavadini) (2010-12-04)
Re: Language inference tool ? gneuner2@comcast.net (George Neuner) (2010-12-05)

| List of all articles for this month |

From:	glen herrmannsfeldt <gah@ugcs.caltech.edu>
Newsgroups:	comp.compilers
Date:	Thu, 2 Dec 2010 21:59:34 +0000 (UTC)
Organization:	A noiseless patient Spider
References:	10-11-040 10-12-003
Keywords:	parse, tools
Posted-Date:	03 Dec 2010 08:00:58 EST

Torben Fgidius Mogensen <torbenm@diku.dk> wrote:
> Vivien Parlat <pocky6@gmail.com> writes:

>> My job requires me to work with a no-more-supported and unknown
>> language. I'm considering the idea of building tools in order to
>> facilitate the work with it, but defining its grammar manually could
>> be very tedious.

(snip)

> Grammar inference will not give you anything useful.

That was my first thought, but then I wasn't so sure.

For one, the usual programming languages use only a small subset of
the possible grammars. With the appropriate limitations, it might be
possible to get a good start on one.

> If you only show positive examples, the simplest grammar is the one
> that accepts any text string, and the most precise is the one that
> accepts the examples and nothing else.

OK, consider trying to find the C grammar from a reasonably sized
sample of C code. Most likely, it wouldn't be able to separate
library calls from statements, but maybe that isn't so bad.
Otherwise, it is statistical. If a certain word appears many times at
the beginning of a statement (line?) then it is likely a statement
keyword. In between the two limits, there are some that are likely to
accept other valid examples, and less likely to accept ones that
aren't valid.

> Anything in between these two extremes is extremely difficult to
> define, and I know of no tool that will give a useful result,
> especially if the grammar is inherently large.

Large is a problem, but that is exactly the case where an automated
system is most useful. How about COBOL for an example? How hard
would it be to extract the COBOL grammer from sample code? Is the
extraction program allowed to know that blanks are significant, and
that keywords are reserved? The more hints, the better the result is
likely to be.

> Also, even if you manage to infer a grammar, this will only give you
> a recognizer for valid programs but not any useful structure, as the
> inferred grammar is likely to be highly ambiguous. Making it
> unambiguous requires expert knowledge on par with what is required to
> write a grammar from scratch.

Well, consider it in the way that OCR was not so long ago. (and
likely still is). It isn't perfect, but sometimes editing the output
of OCR is easier than typing it all by hand. The statistics (types of
mistakes) are different, though.

> So while grammar inference may be of theoretical interest, it is not
> useful for writing grammars for programming languages.

How about just a verifier, for a hand written grammar? One could
start writing, then have a program show some statements that it
doesn't accept, to be added by hand. Iterate until good enough.

But another question is, what does one want to do with the result?

If one wants a syntax verifier, then the problem is somewhat
easier than needed for a compiler. For example, a compiler is
usually expected to generate a parse tree with the appropriate
operator precedence already included. A verifier just needs
to verify that the input can be parsed, but doesn't need to
know about precedence. It does seem unlikely that an automated
system would extract precedence from sample input.

Many years ago I used the BNF description of Fortran in the IBM
manual for a Fortran syntax verifier. I don't remember it being
in the compiler manuals (Program Logic Manuals), but it was in
the Syntax Verifier manual. But syntax verifiers aren't used
much anymore.

> It will probably be more useful to try to uncover documentation for the
> language. If you don't know its name or anything, you could try to post
> an example of code on comp.languages.misc and see if anyone recognizes
> it.

-- glen

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Language inference tool ?

glen herrmannsfeldt <gah@ugcs.caltech.edu>Thu, 2 Dec 2010 21:59:34 +0000 (UTC)

glen herrmannsfeldt <gah@ugcs.caltech.edu>
Thu, 2 Dec 2010 21:59:34 +0000 (UTC)