Re: terminological problem (EBNF & regular expressions)

"Paul Mann" <paul@parsetec.com>
17 Oct 2005 00:30:12 -0400

          From comp.compilers

Related articles
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-14)
Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-15)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-17)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-19)
Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-19)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-20)
Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-23)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-26)
Re: terminological problem (EBNF & regular expressions) RLake@oxfam.org.uk (2005-10-26)
| List of all articles for this month |
From: "Paul Mann" <paul@parsetec.com>
Newsgroups: comp.compilers
Date: 17 Oct 2005 00:30:12 -0400
Organization: Compilers Central
Keywords: lex, syntax
Posted-Date: 17 Oct 2005 00:30:12 EDT

> First I try to explain my problem again: I think it is a widespread
> opinion, that regular expressions are written with a notation using
> '?', '*' and '+', as this notation is used in many text processing
> software.


And the other notation is widely used also (e.g. Microsoft guide to
SQL). It just depends what you are reading. Actually I got the
original idea from


"A Human Engineered Variant of BNF", Sigplan Notices 1980, by Henry
F. Ledgard.


And the other grammar notation I used is similar to DeRemer and
Pennello's TWS and Bill Barrett's Qparser. It was the result of an
"evolution" from the original


<nonterminal> ::= <tail_symbol1> <tail_symbol2>


kind of notation from the Algol 60 days.


> I am looking for a word for this kind of regular expressions.


EBNF !


> I want to contrast them against regular expressions with a notation
> with [...], {...} and from expressions using recusive rules. I want
> to emphasize, that one has not to learn a new grammar, to use the
> TextTransformer


I don't know why you have a dislike of recursive rules, and especially
left-recursive rules. I find left recursion to be much more natural
than the right recursion required by LL systems. In fact, LL grammars
are not grammars strictly speaking -- they are top-down program
specifications. But this is a topic for another article.


Well, your notation is a new grammar for me. And my notation is just
BNF (you don't need to use EBNF if you aren't used to it).


> By regular expressions this is not possible. Non
> nesting block comments can be defined with the TextTransformer as:
>
> <comment1b> -> /\*[^*]*\*+([^/*][^*]*\*+)*/


I would not call this readable. It looks like greek to me.
What if an end-of-file is encountered before the */ ? Then
your lexer will just keep on reading garbage and may never
find a */.


> But in other respects regular expressions are more concise than a
> lexer grammar. Often, predefined character classes or their negations
> are sufficient for the definition of tokens.
> The translation of your examples is:
>
> <identifier> -> [A-Za-z_]\w*
> <integer> -> \d+
> <spaces> -> \s+ // to translate your example exactly: [\t\n ]+
> <comment2> -> //[^\n]*


I think you left out digits in the <identifier> definition.


Yes, this is more concise, but it's new to me and therefore I would
have to learn it. BTW, how do you specify an end-of-file (26)?
How about a null (0)? Why not \l for lowercase letters and \L for
uppercase letters? What is \w ?? How does one write a '+' that does
not mean "one or more"? I think it's confusing.


I wish the world could agree on a standard notation for defining
languages, but it seems everyone who writes a parser generator wants
to use his own notation, different from everyone else.


There seems to be no basis for determining which notation is better.
"More textbooks use it" is hardly a scientific or human-factors
criteria. A committee of PhD's, who have never written a parser
generator is not the answer either.


The ISO standard for BNF is horrible in my opinion, so I'm glad
that it has not become popular.


And the worst thing of all is those systems that have you put a
comma after every symbol in the grammar so that you can use
spaces between words ...


for stmt -> 'for', '(', initial exp, ';', limit exp, ';', iteration, ')'


YUCK!!


In summary, I'm not trying to criticize your system. I'm just trying
to give you my point of view in a blunt way. I appreciate your
interest in this topic.


BTW, show me a more concise way to specify "zero or more of X or Y or
Z separated by commas"


LRgen allows one to do it like this: [X|Y|Z]/','...


Also read the paper on TBNF grammar notation at:
http://parsetec.com/papers.html


This appears to be the most concise notation for specifying the
complete translation process from source code to intermediate code --
all done in a grammar, without including any code in the grammar.


Paul Mann
http://parsetec.com
[The reason you might want to avoid recursive rules is so your
expression can be encoded as a state machine rather than a state
machine plus a stack. -John]



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.