17 Oct 2005 00:30:12 -0400

Related articles |
---|

Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-14) |

Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-15) |

Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-17) |

Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-19) |

Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-19) |

Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-20) |

Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-23) |

Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-26) |

Re: terminological problem (EBNF & regular expressions) RLake@oxfam.org.uk (2005-10-26) |

From: | "Paul Mann" <paul@parsetec.com> |

Newsgroups: | comp.compilers |

Date: | 17 Oct 2005 00:30:12 -0400 |

Organization: | Compilers Central |

Keywords: | lex, syntax |

*> First I try to explain my problem again: I think it is a widespread*

*> opinion, that regular expressions are written with a notation using*

*> '?', '*' and '+', as this notation is used in many text processing*

*> software.*

And the other notation is widely used also (e.g. Microsoft guide to

SQL). It just depends what you are reading. Actually I got the

original idea from

"A Human Engineered Variant of BNF", Sigplan Notices 1980, by Henry

F. Ledgard.

And the other grammar notation I used is similar to DeRemer and

Pennello's TWS and Bill Barrett's Qparser. It was the result of an

"evolution" from the original

<nonterminal> ::= <tail_symbol1> <tail_symbol2>

kind of notation from the Algol 60 days.

*> I am looking for a word for this kind of regular expressions.*

EBNF !

*> I want to contrast them against regular expressions with a notation*

*> with [...], {...} and from expressions using recusive rules. I want*

*> to emphasize, that one has not to learn a new grammar, to use the*

*> TextTransformer*

I don't know why you have a dislike of recursive rules, and especially

left-recursive rules. I find left recursion to be much more natural

than the right recursion required by LL systems. In fact, LL grammars

are not grammars strictly speaking -- they are top-down program

specifications. But this is a topic for another article.

Well, your notation is a new grammar for me. And my notation is just

BNF (you don't need to use EBNF if you aren't used to it).

*> By regular expressions this is not possible. Non*

*> nesting block comments can be defined with the TextTransformer as:*

*>*

*> <comment1b> -> /\*[^*]*\*+([^/*][^*]*\*+)*/*

I would not call this readable. It looks like greek to me.

What if an end-of-file is encountered before the */ ? Then

your lexer will just keep on reading garbage and may never

find a */.

*> But in other respects regular expressions are more concise than a*

*> lexer grammar. Often, predefined character classes or their negations*

*> are sufficient for the definition of tokens.*

*> The translation of your examples is:*

*>*

*> <identifier> -> [A-Za-z_]\w**

*> <integer> -> \d+*

*> <spaces> -> \s+ // to translate your example exactly: [\t\n ]+*

*> <comment2> -> //[^\n]**

I think you left out digits in the <identifier> definition.

Yes, this is more concise, but it's new to me and therefore I would

have to learn it. BTW, how do you specify an end-of-file (26)?

How about a null (0)? Why not \l for lowercase letters and \L for

uppercase letters? What is \w ?? How does one write a '+' that does

not mean "one or more"? I think it's confusing.

I wish the world could agree on a standard notation for defining

languages, but it seems everyone who writes a parser generator wants

to use his own notation, different from everyone else.

There seems to be no basis for determining which notation is better.

"More textbooks use it" is hardly a scientific or human-factors

criteria. A committee of PhD's, who have never written a parser

generator is not the answer either.

The ISO standard for BNF is horrible in my opinion, so I'm glad

that it has not become popular.

And the worst thing of all is those systems that have you put a

comma after every symbol in the grammar so that you can use

spaces between words ...

for stmt -> 'for', '(', initial exp, ';', limit exp, ';', iteration, ')'

YUCK!!

In summary, I'm not trying to criticize your system. I'm just trying

to give you my point of view in a blunt way. I appreciate your

interest in this topic.

BTW, show me a more concise way to specify "zero or more of X or Y or

Z separated by commas"

LRgen allows one to do it like this: [X|Y|Z]/','...

Also read the paper on TBNF grammar notation at:

http://parsetec.com/papers.html

This appears to be the most concise notation for specifying the

complete translation process from source code to intermediate code --

all done in a grammar, without including any code in the grammar.

Paul Mann

http://parsetec.com

[The reason you might want to avoid recursive rules is so your

expression can be encoded as a state machine rather than a state

machine plus a stack. -John]

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.