[ann] regexp-engine 0.11 (written entirely in Ruby)

Simon Strandgaard <neoneye@adslhome.dk>
6 Jun 2004 15:28:01 -0400

          From comp.compilers

Related articles
[ann] regexp-engine 0.11 (written entirely in Ruby) neoneye@adslhome.dk (Simon Strandgaard) (2004-06-06)
| List of all articles for this month |
From: Simon Strandgaard <neoneye@adslhome.dk>
Newsgroups: comp.compilers
Date: 6 Jun 2004 15:28:01 -0400
Organization: TDC Totalloesninger
Keywords: lex
Posted-Date: 06 Jun 2004 15:28:01 EDT

http://freshmeat.net/projects/rubyregexp/


For quite some time I have been working on my own regexp engine.
Its quite complete, and may have educational value to
others. Ruby code is relative easy to read, and if you are
familiar with the Interpreter design pattern, then its hopefully
understandable.


It has been developed with the test-first paradigm, and has
+2000 tests, blackbox as well as whitebox. It can operate on
UTF-8 and ASCII encoded data. Syntax ala perl5 are well supported.


I near future I will extend the perl6 syntax, and add
more encodings: UTF16BE, UTF16LE, Big5, SJIS, EUC.


If anyone here at comp.compilers has ideas on how to do
backreferences-inside-lookbehind, then please reply :-)




The current status of perl5 support are:


    a|b|c alternation
    [...] [^...] character class.. and inverse charclass
    [[:alpha:]] posix character class
    [[:^alpha:]] inverse posix character class
    . dot matches anything except newline, same as [^\n]
    \1 .. \9 backreference . . . . . . . . . . . . . . . . . . . . . . see [3]
    * *? loop 0 or more times greedy/lazy
    + +? loop 1 or more times greedy/lazy
    {n,} {n,}? loop n or more times greedy/lazy
    ? ?? loop 0..1 times greedy/lazy
    {n,m} {n,m}? loop n..m times greedy/lazy
    {n} {n}? loop n times greedy/lazy
    ( ... ) capturing group
    (?: ... ) non-capturing group
    (?> ... ) atomic grouping
    (?= ... ) positive-lookahead
    (?! ... ) negative-lookahead . . . . . . . . . . . . . . . . . . . see [2]
    (?<= ... ) positive-lookbehind . . . . . . . . . . . . . . . . . . . see [1]
    (?<! ... ) negative-lookbehind . . . . . . . . . . . . . . . . . . . see [1], [2]
    (?# ... ) posix-comment
    (?i) (?-i) ignorecase on/off
    (?m) (?-m) multiline on/off
    (?x) (?-x) extended on/off
    ^ \A begin of line, begin of string
    $ \z \Z end of line, end of string (excl newline)
    \b \B word boundary, nonword boundary
    \d \D [[:digit:]] and the inverse [^[:digit:]]
    \s \S [[:space:]] and the inverse [^[:space:]]
    \w \W [[:word:]] and the inverse [^[:word:]]
    \x20 hex . . . . . . . . . . . . . . . . . . . . . . . . . . . see [4]
    \040 octal . . . . . . . . . . . . . . . . . . . . . . . . . . see [3], [4]
    \x{deadbeef} widechar codepoint specified as hex
    \n newline
    \a bell
    \ escape next char


1. Variable-width-lookbehind are fairly supported by this engine.
      For instance this (?<=(a.*)g) is a valid expression.
      Beware that the left-most-longest rule is inversed inside lookbehind,
      and that Backreferences are not possible (yet).


2. Subcaptures inside negative-lookahead/behind are empty
      at the moment.


3. If one tries to backreference a not-existing capture then it
      will be interpreted as an octal symbol.


4. When encoding is ASCII, you can specify hex/octal values in
      the range 0-255. However when encoding is UTF8 then only the
      range 0-127 are valid, in this case the range 128-255 is undefined.




--
Simon Strandgaard


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.