Re: terminological problem (EBNF & regular expressions)

Detlef Meyer-Eltz <Meyer-Eltz@t-online.de>
19 Oct 2005 02:41:18 -0400

          From comp.compilers

Related articles
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-14)
Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-15)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-17)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-19)
Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-19)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-20)
Re: terminological problem (EBNF & regular expressions) Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2005-10-23)
Re: terminological problem (EBNF & regular expressions) paul@parsetec.com (Paul Mann) (2005-10-26)
Re: terminological problem (EBNF & regular expressions) RLake@oxfam.org.uk (2005-10-26)
| List of all articles for this month |
From: Detlef Meyer-Eltz <Meyer-Eltz@t-online.de>
Newsgroups: comp.compilers
Date: 19 Oct 2005 02:41:18 -0400
Organization: Compilers Central
References: 05-10-104
Keywords: syntax
Posted-Date: 19 Oct 2005 02:41:18 EDT

> I don't know why you have a dislike of recursive rules, and
> especially left-recursive rules.


> [The reason you might want to avoid recursive rules is so your
> expression can be encoded as a state machine rather than a state
> machine plus a stack. -John]


Dear John, thank you for your assistance and your argument certainly
isn't bad. But the matter is much more trivial: I don't generally
dislike recursive rules. I will soon the release version 1.0.0 of my
TextTransformer and I want to tell potential customers, what they can
expect from the program.


I got two direct answers on my last mail here and they brought me to
the idea, to create the term: Kleene notation. Does anybody think,
this term might be wrong or misleading? Do you mean, that one can
imagine what is meant?


When I began with the development of the TextTransformer some years
ago I had to decide how to construct the scanner and - leaned on
Coco/R - I indeed began with a state machine for tokens defined in
Wirth's EBNF notation. I threw it away and replaced it by Dr. John
Maddocks boost library for regular expressions:


http://www.boost.org/libs/regex/doc/index.html


There were several reasons to do so:


1. I admire his work very much
2. this library is tested by a great community
3. sub-expressions can be accessed
4. I could use the regex_search method for my SKIP-feature, which
      allows parsing of texts with incomplete grammars.
4. regular expression can be used elsewhere in the program, so a
      minimum additional code is needed to add a parser, generated by the
      TextTransformer
5. the library has good chances to become a standard, see


http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1429.htm


6. the syntax of these expressions is widespread.


> And the other notation is widely used also (e.g. Microsoft guide to
> SQL). It just depends what you are reading.


I cite from the article just mentioned:


"Regular expressions are used by many familiar Unix utilities,
including egrep, grep, lex, and sed[1], and are supported directly in
many programming languages including awk[1], perl[2], C#[3], and
ECMAScript[4] (also known as JavaScript), and indeed are often used
ubiquitously wherever text processing occurs."


The TextTransformer is not only intended for programmers and people
experienced in compiler construction. The TextTransformer can be used
as a stand-alone application by end users too.


7. the laborious task of defining character classes often is dropped


> Why not \l for lowercase letters and \L for
> uppercase letters? What is \w ?? How does one write a '+' that
> does
> not mean "one or more"? I think it's confusing.


indeed there are


\l Any lower case character a-z
\L Any non lower case character a-z
\u Any upper case character A-Z
\U Any non upper case character A-Z
\w (word) all alphanumeric characters plus the underscore


Meta characters like '+' must be preceded by a backslash, to get their
literal meaning




>> <comment1b> -> /\*[^*]*\*+([^/*][^*]*\*+)*/


> I would not call this readable. It looks like greek to me.


Here I have to admit, that you are right. The expression is the result
of a speed optimization.
Such pre-defined expressions can be selected in the TextTransformer
IDE by mouse click in a popup menu.




> What if an end-of-file is encountered before the */ ? Then
> your lexer will just keep on reading garbage and may never
> find a */.


The comment expression would be a part of an expression, describing,
which parts of text the parser has to ignore. E.g.:


(\s \// spaces
|//[^\n]*\n \// line comment
|/\*[^*]*\*+([^/*][^*]*\*+)*/ \// block comment
)+


If an end-of-file is encountered before the "*/", the expression would
not be recognized and so it would not be ignored. The parser would try
to recognize "/*...". Depending on the grammar, "/*..." would be
parsed as anything other than a comment, or, for a correct c++
grammar, the result would be a parse error.




> BTW, how do you specify an end-of-file (26)?
> How about a null (0)?


(What means BTW? I am from Germany. )


Control characters, and other characters too, can be encoded in
hexadecimal form: \x..
Normally it's not necessary to specify end-of-file (0), as at the
end-of-file in most cases also is the end of the grammar. But there is
a special token EOF provided in the TextTransformer, for cases, where
you want to finish parsing correctly, even if you are the middle of a
rule. E.g. values, returned by productions, are parsed in the
TextTransformer interpreter grammar by means of EOF:


{{int i = }} production


There is also a common expression for the end-of-buffer "\z", which in
the TextTransformer indeed is the end of file.






> BTW, show me a more concise way to specify "zero or more of X or Y
> or
> Z separated by commas"
> LRgen allows one to do it like this: [X|Y|Z]/','...


Similar concise: ((X|Y|Z),)*




> The ISO standard for BNF is horrible in my opinion, so I'm glad
> that it has not become popular.


So I can finish this mail conciliatorily: I agree!






Regards


Detlef Meyer-Eltz


--
mailto:Meyer-Eltz@t-online.de


http://www.texttransformer.de
http://www.texttransformer.com
http://www.texttransformer.org
http://www.text-konverter.homepage.t-online.de


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.