Re: perl regular expression grammar

abigail@foad.org (Abigail)
18 Jul 2001 20:02:27 -0400

          From comp.compilers

Related articles
perl regular expression grammar alan@oursland.net (2001-07-17)
Re: perl regular expression grammar merlyn@stonehenge.com (2001-07-18)
Re: perl regular expression grammar ralph@inputplus.demon.co.uk (2001-07-18)
Re: perl regular expression grammar johnmillaway@yahoo.com (John W. Millaway) (2001-07-18)
Re: perl regular expression grammar mjd@plover.com (2001-07-18)
Re: perl regular expression grammar abigail@foad.org (2001-07-18)
Re: perl regular expression grammar alan@oursland.net (2001-07-23)
Re: perl regular expression grammar usenet11522@itz.pp.sci.fi (Ilmari Karonen) (2001-07-23)
Re: perl regular expression grammar mjd@plover.com (2001-08-02)
| List of all articles for this month |

From: abigail@foad.org (Abigail)
Newsgroups: comp.lang.perl.misc,comp.compilers
Date: 18 Jul 2001 20:02:27 -0400
Organization: Abigail's Kinderboerderijen
References: 01-07-080
Keywords: syntax
Posted-Date: 18 Jul 2001 20:02:27 EDT
X-Date: MMDCCCLXXVIII September MCMXCIII

Alan Oursland (alan@oursland.net) wrote on MMDCCCLXXVIII September
MCMXCIII
## I've been looking for a complete perl 5 regular expression grammar
## and, having been unsuccessful in my search, have attempted to write
## one myself. I was wondering if anyone could help me find any errors in
## it (excluding grammar syntax errors). I've left out embedded modifiers
## from the grammar -- I'm not sure how they fit into the grammar. I've
## also skimmed over the non-meta character production. One area I am
## confused is the "\c[" control character (described at
## http://www.perldoc.com/perl5.6/pod/perlre.html). How does this work?
##
## Alan Oursland
##
## Here is the grammar:
## <re> ::= <union>
## <union> ::= <concat>"|"<union> | <concat>
## <concat> ::= <quant><concat> | <quant>
## <quant> ::= <group>"*" | <group>"+" | <group>"?" | <group>"{"<bound>"}" | <group>
## <group> ::= "("<re>")" | <term>
## <term> ::= "." | "$" | "^" | <char> | <set>
## <bound> ::= <num> | <num>"," | <num>","<num>
## <char> ::= <non-meta> | "\"<escaped>
## <non-meta> ::= any non-meta char
## <escaped> ::= <meta>|<control>|<special>|<assert>
## <meta> ::= "."|"^"|"$"|"?"|"*"|"+"|"|"|"["|"("|")"|"\"|"{"
## <control> ::= "t"|"n"|"r"|"f"|"a"|"e"|"l"|"u"|"L"|"U"|"E"|"Q"
## <special> ::= <backoctal>|<hexchar>|<controlchar>|<class>
## <assert> ::= "b"|"B"|"A"|"z"|"Z"|"G"
## <backoctal> ::= <digit> | <digit><digit> | "0"<oct><oct> | "+" | "&" | "`" | "'"
## <hexchar> ::= "x"<hex><hex> | "x{"<hex><hex><hex><hex>"}"
## <controlchar> ::= "c["
## <namedchar> ::= "N{"<name>"}"
## <class> ::= "w"|"W"|"s"|"S"|"d"|"D"|"X"|"C" |"p"<name>|"P"<name>|"[:"<posixclass>":]"|"[:^"<posixclass>":]"
## <posixclass> ::= "alpha"|"alnum"|"ascii"|"cntrl"|"digit"|"graph"|"lower"|"print"|"punct"|"space"|"upper"|"word"|"xdigit"
## <name> ::= <unicodeclass>
## <unicodeclass> ::= "IsAlpha"|"IsAlnum"|"IsASCII"|"IsCntrl"|"IsDigit"|"IsGraph"|"IsLower"|"IsPrint"|"IsPunct"|"IsSpace"|"IsUpper"|"IsWord"|"IsXDigit"
## <set> ::= "[" <set-items> "]" | "[^" <set-items> "]"
## <set-items> ::= <set-item> | <set-item> <set-items>
## <set-item> ::= <range> | <char>
## <range> ::= <char> "-" <char>
## <num> ::= <digit><num> | <digit>
## <oct> ::= "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"
## <digit> ::= "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
## <hex> ::= "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"|"a"|"b"|"c"|"d"|"e"|"f"|"A"|"B"|"C"|"D"|"E"|"F"
## <mod> ::= "\i"|"\m"|"\s"|"\x"






Some regexes that I cannot parse using the above grammar:


        /3*?/
        /g{,1}/
        /\cM/
        /aa/
        /[*]/
        /(?!!)/


Some regexes can be parsed ambigiously with the above grammar:


        /\+/


(why are "+", "&", "`" and "'" mentioned in <backoctal>?)




Here are some modifications of the grammar that fixes some of the
issues:




Fixes /3*?/:
        <quant> ::= <group><quantifier><greedy> | <group>
        <quantifier> ::= "*" | "?" | "+" | "{" <bound> "}"
        <greedy> ::= "?" | ""


Fixes /aa/:
        <term> ::= "." | "$" | "^" | <chars> | <set>
        <char> ::= <non-meta> | "\"<escaped>


Fixes /\cM/:
        <controlchar> ::= "c" <any-char>
        <any-char> ::= Any possible character.


(but that doesn't fix /\c/ and would allow /\c\/ which doesn't parse in Perl).






Abigail
--
srand 123456;$-=rand$_--=>@[[$-,$_]=@[[$_,$-]for(reverse+1..(@[=split
//=>"IGrACVGQ\x02GJCWVhP\x02PL\x02jNMP"));print+(map{$_^q^"^}@[),"\n"
__END__
A bee crawling in // the branches of a hazel. A // pair of bears. Bankei.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.