Re: Language inference tool ?

George Neuner <gneuner2@comcast.net>
Sun, 05 Dec 2010 05:11:11 -0500

From comp.compilers

Related articles
Language inference tool ? pocky6@gmail.com (Vivien Parlat) (2010-11-29)
Re: Language inference tool ? gneuner2@comcast.net (George Neuner) (2010-12-01)
Re: Language inference tool ? torbenm@diku.dk (2010-12-01)
Re: Language inference tool ? gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-12-02)
Re: Language inference tool ? valeriocavadini@hotmail.com (Valerio Cavadini) (2010-12-04)
*Re: Language inference tool ? gneuner2@comcast.net (George Neuner)* (2010-12-05)**

| List of all articles for this month |

From:	George Neuner <gneuner2@comcast.net>
Newsgroups:	comp.compilers
Date:	Sun, 05 Dec 2010 05:11:11 -0500
Organization:	A noiseless patient Spider
References:	10-11-040 10-12-003 10-12-005
Keywords:	tools, parse
Posted-Date:	05 Dec 2010 12:21:04 EST

On Thu, 2 Dec 2010 21:59:34 +0000 (UTC), glen herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:

>Torben Fgidius Mogensen <torbenm@diku.dk> wrote:
>> Vivien Parlat <pocky6@gmail.com> writes:
>
>>> My job requires me to work with a no-more-supported and unknown
>>> language. I'm considering the idea of building tools in order to
>>> facilitate the work with it, but defining its grammar manually could
>>> be very tedious.
>
>(snip)
>
>> Grammar inference will not give you anything useful.
>
>That was my first thought, but then I wasn't so sure.
>
>For one, the usual programming languages use only a small subset of
>the possible grammars. With the appropriate limitations, it might be
>possible to get a good start on one.
>
>> If you only show positive examples, the simplest grammar is the one
>> that accepts any text string, and the most precise is the one that
>> accepts the examples and nothing else.
>
>OK, consider trying to find the C grammar from a reasonably sized
>sample of C code. Most likely, it wouldn't be able to separate
>library calls from statements, but maybe that isn't so bad.
>Otherwise, it is statistical. If a certain word appears many times at
>the beginning of a statement (line?) then it is likely a statement
>keyword. In between the two limits, there are some that are likely to
>accept other valid examples, and less likely to accept ones that
>aren't valid.

I think a statistical model is not enough to do much useful beyond
identifying keywords and maybe some latent structure. The NLP
programs that I mentioned in my previous post work with a "parts of
speech" dictionary that gives them usage information. Combined with a
statistical model of ordering and grouping, that allows them to infer
grammar rules from raw text. Such programs generally assume that
words not in the dictionary are most likely to be noun forms.

Working statistically, the inferencer might be able to determine that,
for example

IF ( a ) b (ELSE c)? where a,b,c etc. are pattern variables

is a valid language construct. But lacking knowledge of data types or
that IF is an interrogative, the inferencer has no way to determine
that 'a' should be an expression that reduces to a boolean.

Similarly ambiguous constructs like the hanging ELSE, etc. will
present problems for an unaugmented statistical model.

>> Anything in between these two extremes is extremely difficult to
>> define, and I know of no tool that will give a useful result,
>> especially if the grammar is inherently large.
>
>Large is a problem, but that is exactly the case where an automated
>system is most useful. How about COBOL for an example? How hard
>would it be to extract the COBOL grammer from sample code? Is the
>extraction program allowed to know that blanks are significant, and
>that keywords are reserved? The more hints, the better the result is
>likely to be.

Unless the hints include parts of speech I think you can hint all you
want but it won't enough even for simple languages. Consider Lisp, a
language with very few keywords, a very regular syntax and little
syntactic distinction between code and data. Without knowledge that
code requires the first word after the left parenthesis either to be a
keyword or name a function, all an inferencer could reasonably
determine is that the language consists of lists of words surrounded
by parentheses.

>> Also, even if you manage to infer a grammar, this will only give you
>> a recognizer for valid programs but not any useful structure, as the
>> inferred grammar is likely to be highly ambiguous. Making it
>> unambiguous requires expert knowledge on par with what is required to
>> write a grammar from scratch.
>
>Well, consider it in the way that OCR was not so long ago. (and
>likely still is). It isn't perfect, but sometimes editing the output
>of OCR is easier than typing it all by hand. The statistics (types of
>mistakes) are different, though.

Yes. But recognition is different from inferring structure - and
again, good OCR programs use a dictionary to guess at what character
is reasonable when the actual input can't be recognized. Without
using a dictionary, the percentage of mistakes is much greater.
[I used to work in machine vision and I've written OCR/OCV code.]

>> So while grammar inference may be of theoretical interest, it is not
>> useful for writing grammars for programming languages.
>
>How about just a verifier, for a hand written grammar? One could
>start writing, then have a program show some statements that it
>doesn't accept, to be added by hand. Iterate until good enough.

An interesting idea, but since the test input already would have been
sorted into accept/reject groups, the human already should have a
sense of what input the current grammar might handle incorrectly.

In the case where someone has been working with the language and
already knows some or all of the structure, I think it is quicker to
have that person just sit down and write a grammar for it.

It's unclear whether the OP falls into this category or has simply
been handed a project in an unfamiliar area. But my guess is that he
can learn the language from examples faster than he can get an
inference tool doing anything useful.

>But another question is, what does one want to do with the result?
>
>If one wants a syntax verifier, then the problem is somewhat
>easier than needed for a compiler. For example, a compiler is
>usually expected to generate a parse tree with the appropriate
>operator precedence already included. A verifier just needs
>to verify that the input can be parsed, but doesn't need to
>know about precedence. It does seem unlikely that an automated
>system would extract precedence from sample input.
>
>Many years ago I used the BNF description of Fortran in the IBM
>manual for a Fortran syntax verifier. I don't remember it being
>in the compiler manuals (Program Logic Manuals), but it was in
>the Syntax Verifier manual. But syntax verifiers aren't used
>much anymore.

I don't really see the distinction ... the parser, sans actions, acts
as a syntax verifier. In the days of batch computing where small
memories required compilers to use multiple passes and operator
intervention was needed to change tapes or card decks, there was
significant utility in breaking out a syntax checker as a separate
program - no point wasting machine or operator time running the
compiler until a program was, at least, syntactically correct.

But today I don't see this as being very useful. YMMV.

>> It will probably be more useful to try to uncover documentation for the
>> language. If you don't know its name or anything, you could try to post
>> an example of code on comp.languages.misc and see if anyone recognizes
>> it.

George

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Language inference tool ?

George Neuner <gneuner2@comcast.net>Sun, 05 Dec 2010 05:11:11 -0500

George Neuner <gneuner2@comcast.net>
Sun, 05 Dec 2010 05:11:11 -0500