Re: additional regular expression operators

Ralph Boland <rpboland@gmail.com>
Tue, 31 Mar 2009 18:07:38 -0700 (PDT)

          From comp.compilers

Related articles
additional regular expression operators rpboland@gmail.com (Ralph Boland) (2009-03-29)
Re: additional regular expression operators m.helvensteijn@gmail.com (2009-03-30)
Re: additional regular expression operators zaimoni@zaimoni.com (2009-03-30)
Re: additional regular expression operators haberg_20080406@math.su.se (Hans Aberg) (2009-03-30)
Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-03-31)
Re: additional regular expression operators rpboland@gmail.com (Ralph Boland) (2009-03-31)
Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-14)
Re: additional regular expression operators zayenz@gmail.com (MZL) (2009-04-15)
Re: additional regular expression operators anton@mips.complang.tuwien.ac.at (2009-04-16)
Re: additional regular expression operators gah@ugcs.caltech.edu (glen herrmannsfeldt) (2009-04-16)
Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-17)
Re: additional regular expression operators mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-04-17)
| List of all articles for this month |

From: Ralph Boland <rpboland@gmail.com>
Newsgroups: comp.compilers
Date: Tue, 31 Mar 2009 18:07:38 -0700 (PDT)
Organization: Compilers Central
References: 09-03-111 09-03-123
Keywords: lex
Posted-Date: 04 Apr 2009 09:44:40 EDT

On Mar 31, 5:30 am, torb...@pc-003.diku.dk (Torben Fgidius Mogensen)
wrote:
> Ralph Boland <rpbol...@gmail.com> writes:
> > I plan to provide support for at least three additional
> > unary operators to the standard three (?,*, and +)
> > (plus new binary operators but that's a topic for another day).
> > For a regular expression R they are:
>
> > R! :does not accept the empty string but otherwise
> > accepts every expression that R does. Very useful.
>
> > R~ :accepts exactly the set of strings that R does not accept.
> > note that R = R~~.
>
> > R% : equivalent to the string (R~)R
>
> I can see the usefulness of the first two, but not really the third.
> Do you have an example?


It's useful in pattern matching where we want to find the first
occurrence of pattern R. Admittedly this only gives us the
point where R ends and we still need to find where R starts.


It is also useful in scanners where we have comments.
If a comment starts with <: and ends with :> the
definition of a comment would be:
              <:(:>)%.


The rarity of applications of the % operator could admittedly
be used to argue that the % operator is not worth implementing.
After all, we could have instead written: <:(:>)~:>.
>
> In general, I find R\S (strings in R but not in S) more useful than
> the above, and you can easily define the above using this: R! = R\e,
> where e is the empty string and R~ = (.*)\R, where . represents any
> character.


I intend to implement set difference but as it is not a unary operator
it was not the subject of my posting. I don't have an explicit symbol
for the empty string and am not planning to add it. If find the '!'
operator very useful and would still want it even if I could write R\e
as you propose. To me, allowing an e symbol is only useful if you
plan to get rid of both '?' and '!' operators, which is not my
preference. While it is true that R~ = (.*)\R I prefer to support
both operators.


> > I have put these operators after the expression but in fact I prefer
> > to have them in front
> > because:
> > 1) I prefer to have unary operators in front of their expressions
> > as in "-5" rather than "5-".
>
> That is a matter of taste, but you would IMO need more compelling
> reasons to change what has been standard for half a century.


Yes, I know. I prefer unary operators to be prefix but momentum
is very much against me. I always expected in the end to conform
to the norm and so far this forum has provided me with no reason
to do otherwise.


>
> > 2) Parsing is easier and faster if the unary operator precedes
> > the expression.
>
> Not really. If that was the case, early mathematical calculators
> would have used prefix sine, log, etc., but even calculators with
> infix arithmetic and precedence levels (such as Texas SR51) used
> postfix unary operators.


An interesting point but contrary to my experience. If I write a
grammar the prefix unary operators are easy to handle with both LL(1)
based parsing and LR(1) based parsing. With postfix operators I have
problems with needing to recognize an expression too soon. Examples
of problems with LL(1) parsing are easy to construct. I haven't
written an LR(1) grammar in a while so I can't remember the exact
problems I had but I do remember having them and thinking: "This would
be so much simpler if the unary operators were prefix". I will be
developing a system where the user can define his own operators. My
utilities will then need to generate the grammars based upon the
defined operators (and the standard ones). It will matter that the
grammars be as simple and regular as possible.


> > Suggestions as to further operators to support welcome. Suggestions
> > as to what are the best symbols to use for the new operators welcome.
>
> As I said, R\S is very useful. I have also often found use for "a
> sequence of Rs separated by <symbol>", such as a list of names
> separated by commas. Something like R_s, where s is the symbol could
> be a possible notation, e.g. ([A-Za-z]*)_',' for the above example.
> R_s expands to (Rs)*R, but since R occurs twice in this, you can
> potentially reduce the size of the regexp a lot.


This is the list operator which for me will be a binary operator with
regular expressions on each side. Admittedly one of the operands is
usually a symbol but I see no reason for this restriction. This is a
very useful operator that should be included in any regular expression
based utility. There are also efficiency reasons for providing the
list operator; just try applying the list operator multiple times as
in (R $ S $ T $ U) where $ is the list operator and then expanding the
operator out where R # S = (RS)*S. Then look as what the
corresponding minimum finite state machine looks like.


> Also, you sometimes want a sequence of things in arbitrary order. So
> something like {R,S,T}% as an abbreviation of RST|RTS|SRT|STR|TRS|TSR
> would be useful. But it can lead to extremely large expanded
> expressions or automatons.


I agree that this is useful but I prefer the notation: R @ S @ T.
Note that: R @ S @ T != (R @ S) @ T != R @ (S @ T).


> Note that once you have complement or difference, the construction of
> automatons get more complicated, as you can't construct an NFA
> compositionally and then convert this to a DFA, you have to construct
> a DFA directly. There are standard methods for this based on
> derivatives of regular expressions, though, so it is not a major
> issue. It just means that you can't easily modify an existing tool.


I am writing my tools from scratch. I am aware that complement
and the other set based operators are a major task to implement.
But complement and set difference are too important to leave out.
I have never found a useful example of set intersection though.
I have found so far only one paper on implementing these operators
and it is complex. I will take a look at derivative based algorithms.


Thanks for your comments.


Ralph Boland


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.