Re: re2c-1.0 released!

Kaz Kylheku <398-816-0742@kylheku.com>
Sat, 2 Sep 2017 11:18:56 -0400 (EDT)

          From comp.compilers

Related articles
re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-08-27)
Re: re2c-1.0 released! 398-816-0742@kylheku.com (Kaz Kylheku) (2017-09-02)
Re: re2c-1.0 released! 398-816-0742@kylheku.com (Kaz Kylheku) (2017-09-02)
Re: re2c-1.0 released! anton@mips.complang.tuwien.ac.at (2017-09-02)
Re: re2c-1.0 released! gneuner2@comcast.net (George Neuner) (2017-09-02)
Re: re2c-1.0 released! skvadrik@gmail.com (Ulya Trofimovich) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! jamin.hanson@googlemail.com (Ben Hanson) (2017-09-03)
Re: re2c-1.0 released! 398-816-0742@kylheku.com (Kaz Kylheku) (2017-09-03)
[4 later articles]
| List of all articles for this month |
Reroute: compilers@archive.iecc.com
From: Kaz Kylheku <398-816-0742@kylheku.com>
Newsgroups: comp.compilers
Date: Sat, 2 Sep 2017 11:18:56 -0400 (EDT)
Organization: Aioe.org NNTP Server
References: 17-08-007
Injection-Info: miucha.iecc.com; posting-host="www.ucenet.org:2001:470:1f07:1126:0:7563:656e:6574"; logging-data="31377"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, tools
Posted-Date: 02 Sep 2017 11:18:56 EDT

On 2017-08-27, Ulya Trofimovich <skvadrik@gmail.com> wrote:
> libraries and command-line tools like grep and sed. It may seem that
> only a lack of effort prevents developers of lexer generators like Flex
> from implementing [submatch extraction] (as well as fixing the
> ever-broken trailing contexts [3]).


> [3] https://ftp.gnu.org/old-gnu/Manuals/flex-2.5.4/html_node/flex_23.html


Firstly, your [3] references the documentation of Flex 2.5.4, a version
dating back to, I think, late 1996, making it 21 years old! The
trailing context feature works well enough; it has quirks and
restrictions in its use (which Flex does a decent job of diagnosing, in
my experience).


With that out of the way: the likely reason Lex doesn't feature submatch
extraction (like with \1, \2 and so on) is that this isn't normally
required for lexical analysis. Submatch extraction is useful for
hacky text processing with regexes in the absence of a parser.


This feature will be of the greatest use to using a lexical analyzer
generator *by itself* to optimize a text filtering task (e.g. take some
"sed logic" and try to make it faster by transformation to C), rather
than to scan a programming or data definition language.


If Lex is being used with a parser (like by the denizens of
comp.compilers), there is already an excellent form of backreferencing
available: namely the references to the grammar nodes from the grammar
rule action, such as the $1, $2, $3 ... variables in Yacc, or
the equivalent under some other parser generator.


For instance, someone processing, say a 2017-08-28 style date using
regexes in a POSIX text processing tool might have something like
([0-9]+)-([0-9]+)-([0-9]+), and get the pieces as \1 \2 \3. This
approach would not be greatly required by someone treating the situation
more formally with some grammar phrase structure like INTEGER '-'
INTEGER '-' INTEGER, where the integers can be pulled out as $1, $3 and
$5.


However, it's easy to see though that a date in that form can be
handled as a single token, and then backreferencing can provide
convenient access to the pieces, helpful to code which produces a
three-element semantic value for the token supplying integer values
to the parser. So there is some possible nonzero utility to a language
writer; it's just not severe. In the few places where this sort of
thing could be of benefit, you can code up a small, local workaround.


It's even easier to see that it's a significant burden to deal with
formal syntax above the lexer for someone who just wants a fast text
processing hack that does everything it needs inside the lexer;
that's the use case that sees the greatest benefit.


Even purer opinion follows:


It was a big, big mistake in Unix regexes to endow the grouping
parentheses with semantics by making them delimit numbered capture
registers.


This such a problem that derivatives of POSIX regex syntax, such as
PCRE, found it necessary to add back a "non-capturing" form parenthesis
(expressed using annoying verbiage) just for doing pure grouping.


If you add capturing, at least use a special syntax for it, leaving
the ordinary parenthesis alone.


For instance:


    (?1R)


could mean "match regex R, and capture into register 1", and


    (R)


just means "parse R as an indivisible unit with regard to all
surroundings" without burdening it with any additional semantics.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.