Re: requiring balanced parens in a regexp?

Chris F Clark <cfc@shell01.TheWorld.com>
10 Nov 2006 18:38:56 -0500

From comp.compilers

Related articles
requiring balanced parens in a regexp? petermichaux@gmail.com (Peter Michaux) (2006-11-10)
Re: requiring balanced parens in a regexp? petermichaux@gmail.com (Peter Michaux) (2006-11-10)
Re: requiring balanced parens in a regexp? mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2006-11-10)
Re: requiring balanced parens in a regexp? martin@gkc.org.uk (Martin Ward) (2006-11-10)
Re: requiring balanced parens in a regexp? haberg@math.su.se (2006-11-10)
Re: requiring balanced parens in a regexp? alexc@TheWorld.com (Alex Colvin) (2006-11-10)
*Re: requiring balanced parens in a regexp? cfc@shell01.TheWorld.com (Chris F Clark)* (2006-11-10)**

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	10 Nov 2006 18:38:56 -0500
Organization:	The World Public Access UNIX, Brookline, MA
References:	06-11-039 06-11-040
Keywords:	lex
Posted-Date:	10 Nov 2006 18:38:56 EST

"Peter Michaux" <petermichaux@gmail.com> writes:
>
> Hi John,
>
> Peter Michaux wrote:
>> > Can JavaScript regular expressions ensure that all
>> > parentheses to the right of "test" are closed before proclaiming a
>> > match? If so how? If not must I walk through the string counting how
>> > nested each character is?
>>
>> [Matching parentheses is the classic problem that pure regexp's cannot
>> solve, because they have no way to count the number of unmatched
>> parens. There's a variety of ways to fake it, e.g., take your string,
>> replace each ([^)]+) by a space until there aren't any left, then look
>> for your match. Or use various extended regexps that are more like
>> parsers. -John]
>
> Thanks for the info. My problem is even bigger as there might be mixed
> quotes with various balance in relation to the parens.
>
> I think that I will need to write a real tokenizer and parser to get my
> job done properly. I haven't done that before and it is something I
> have always wanted to do but it seems like a *big* task.

It's not really as big a task as it seems. You do need a tool that
handles languages slightly more complex than simple regular
expressions (i.e. one that handles recursion). However, an LL or LR
tool will do. You just need an LL or LR lexer generator--there are
such beasts.

In particular, we have roughly the same problem for defining "embedded
code" tokens in Yacc++ (and our lexer generator (which is ELR
i.e. regular expressions + LR) handles it for us). In Yacc++ we
define "embedded code" tokens to be matched braces with text inside,
except that in the braces we don't count braces inside quoted strings
or comments (C or C++ style). I have enclosed below the four rules
that do it.

The main subtle thing about the rules, besides that they are recursive
is the @ operator which is a Yacc++ feature that works somewhat
between the . operator of LEX and the [^xxx] not in charset feature of
LEX. The easiest way to understand @ is it says match anything that
isn't otherwise matched.

Therefore, you can read the embedded code rule as saying, the embedded
code is a pair of braces surrounding matching braces, comments,
strings, and anything else (the @). There is a little more dealing
with newlines and single / characters that aren't the start of
comments, but the above is essentially it.

The matching braces is the same thing, so within braces, one can have
additional levels of matching braces (and that rule is recursive).
There is a slight limitation in Yacc++, that rules that define actual
tokens can not themselves be recursive (but can have sub-rules which
are recursive), but other tools may not have that limitation. We have
it only to make some details of our implementation easier. If Yacc++,
didn't have that restriction, the problem could be expressed in 3
rules, by merging the matching braces and embedded code rules.

The comment rule defines C and C++ comments.

The string rule defines strings with either single or double quotes
(and handles \s in the strings.

Note, if we wanted to balance parens and braces, we would need an
extra rule for matching parens. And, yet another, if we wanted to
match brackets also. However, we are only talking about a handfull of
rules at most.

And, the handfull of rules is the point, if you can use recursion with
regular expressions, then the problem isn't that hard. You mostly
just need to write down what you want the rules to be. There won't be
that many. You just need to name each balanced part as it's own rule,
and give the rule a unique name and make it recursive if you allow
nesting of the balanced delimiters.

If you have recursion without regular expressions (plain yacc and/or
Bison), it takes a little more work. If you have regular expressions
without recursion (as in LEX, perl, and most regex packages), you
can't do it directly.

EMBEDDED_CODE :

        "{" { wr_act_bgn(); }
                  (matching_braces | @ | embedded_comment | embedded_string | "/" @
                    | "/\n" // newline
                        { ++yy_buf_lineno(); }
                    | "\n" // newline
                        { ++yy_buf_lineno(); }
                  )*
        "}"
                                ;
        {
                /* remove trailing brace from spelling */
                yy_lex_rslt().as_sym_ptr = (yy_sym_dflt_ptr)yy_lookup(yy_lex_token(), yy_lex_len()-1,
                        EMBEDDED_CODE_);
                        grm_mark_uid(yy_lex_rslt().as_sym_ptr);

                wr_act_end(yy_lex_rslt().as_sym_ptr);
        }

matching_braces :

        "{"
                (matching_braces | @ | embedded_comment | embedded_string | "/" @
                  | "/\n" // newline
                        { ++yy_buf_lineno(); }
                  | "\n" // newline
                        { ++yy_buf_lineno(); }
                )*
        "}"
                                  ;

embedded_comment :

          "/*" (( "\n"
                { ++yy_buf_lineno(); }
          | @ )* "*")+ "/"

    | "//" @ * "\n"
                { ++yy_buf_lineno(); }
                                  ;

embedded_string : '"'
                                            ( "\\\""
                                            | "\\\\"
                                            | "\\" @
                                            | "\\\n" { ++yy_buf_lineno(); }
                                            | "\n" { ++yy_buf_lineno(); }
                                            | @ )*
                                    '"'
                                | "'"
                                            ( "\\'"
                                            | "\\\\"
                                            | "\\" @
                                            | "\\\n" { ++yy_buf_lineno(); }
                                            | "\n" { ++yy_buf_lineno(); }
                                            | @ )*
                                    "'"
                                ;

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: requiring balanced parens in a regexp?

Chris F Clark <cfc@shell01.TheWorld.com>10 Nov 2006 18:38:56 -0500

Chris F Clark <cfc@shell01.TheWorld.com>
10 Nov 2006 18:38:56 -0500