Re: re2c-1.0 released!

Ben Hanson <jamin.hanson@googlemail.com>
Mon, 4 Sep 2017 05:19:49 -0700 (PDT)

          From comp.compilers

Related articles
[5 earlier articles]
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! 398-816-0742@kylheku.com (Kaz Kylheku) (2017-09-03)
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-04)
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-08)
| List of all articles for this month |

From: Ben Hanson <jamin.hanson@googlemail.com>
Newsgroups: comp.compilers
Date: Mon, 4 Sep 2017 05:19:49 -0700 (PDT)
Organization: Compilers Central
References: 17-08-007 17-09-001 17-09-003 17-09-010
Keywords: lex, design

On Sunday, 3 September 2017 18:44:57 UTC+1, Ulya Trofimovich wrote:
> Using TDFA in a runtime library has certain downsides:
>
> 1. It implies the runtime overhead on (lazy) determinization: it's only
> useful if the regexp is compiled once and used many times. Libraries
> such as RE2 (by Russ Cox) or TRE (by Ville Laurikari) mostly use NFA for
> submatching.
>
> 2. Also, due to ahead-of-time compilation lexer generators are able to
> perform extensive optimization of the generated code, including
> minimization of the number of submatch variables and operations on them.
> A runtime library cannot afford that.
>
> However, if TDFA algorithm is to be used in a library, then my paper
> contains two improvements worth consideration:
>
> 1. TDFA(1) with one-symbol lookahead instead of default TDFA(0)
>
> 2. Formal algorithm for POSIX disambiguation semantics.




Ah, overheads. I'm guessing you're being a bit gloomy there, as "DFA
is too slow!" has been the battle cry for NFA advocates for decades
now. I didn't believe it in the '90s and I don't believe it now.


However, you raise an interesting point. There are plenty of regexes I
use at work that are fixed, so they could indeed just use generated
code to save the construction cost at runtime. The only reason I
haven't done this up until now is that it makes maintenance easier to
just have a string you can tweak directly in the source. The holy
grail (I believe) is sufficiently powerful meta-programming to do the
whole thing at compile time, although you could of course do it today
with custom build steps to generate code via re2c or whatever other
tool (that kind of thing tends to scare developers where I work).


When I want lookup speed I use my own (runtime) lexer generator with a
single rule of interest, and then a single skip rule for anything
else. This works very well and I'm sure re2c could easily be used the
same way.


I'm currently playing with searching using LALR(1) grammars and have
written a search tool gram_grep which I will be extending soon. I
would like to put together an LR(*) algorithm to make defining rules
easier for people, but I don't know when I'll have the time to really
get into that.


I believe there are many more possibilities for text processing than
the traditional tools everyone normally turns to and I applaud your
efforts in bringing TDFA to re2c!


Regards,


Ben



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.