Re: Interesting paper on regex NFA matching

Kaz Kylheku <>
Mon, 29 Jan 2024 19:28:02 -0000

          From comp.compilers

Related articles
Interesting paper on regex NFA matching (John R Levine) (2024-01-25)
Re: Interesting paper on regex NFA matching (2024-01-26)
Re: Interesting paper on regex NFA matching (Kaz Kylheku) (2024-01-26)
Re: Interesting paper on regex NFA matching (Christopher F Clark) (2024-01-27)
Re: Interesting paper on regex NFA matching (Kaz Kylheku) (2024-01-27)
Re: Interesting paper on regex NFA matching (Christopher F Clark) (2024-01-29)
Re: Interesting paper on regex NFA matching (Kaz Kylheku) (2024-01-29)
Re: Interesting paper on regex NFA matching (Christopher F Clark) (2024-02-01)
Re: Interesting paper on regex NFA matching (Ev Drikos) (2024-02-04)
| List of all articles for this month |

From: Kaz Kylheku <>
Newsgroups: comp.compilers
Date: Mon, 29 Jan 2024 19:28:02 -0000
Organization: Compilers Central
References: 24-01-008
Injection-Info:; posting-host=""; logging-data="21096"; mail-complaints-to=""
Keywords: lex, performance
Posted-Date: 29 Jan 2024 15:57:01 EST

On 2024-01-29, Christopher F Clark <> wrote:
> Although, our esteemed moderator already corrected this error:
>> [Patterns with alternatives and trailing contexts do have to back up
>> and the flex manual warns it's quite expensive. See the
>> PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]
> I still want to write an extended reply to this, where
> Kaz Kylheku wrote:
>> I think that case can be handled by technology like Lex trailing
>> contexts, which doesn't require backtracking.
> > {digits}/. {
>> // ... set up semantic value ..
>> return INTEGER;
>> }
>> {digits}.{digits} {
>> // ... set up semantic value ..
>> return FLOAT;
>> }
>> Another possibility is to recongize {digits}. as one lexeme and call
>> unput('.'). That's a minor form of backtracking, that doesn't require us
>> to do anything funny to the regex implementation.
> You may not recognize it as backtracking, but it is reading two characters
> past the end of the integer token to disambiguate it.

But some backtracking is required for recognizing the longest matching
prefixes of the input, even if we us nothing but basic regex features!

If we consider the trivial pattern (a*b)+:

Given the input "aaabaaabaaac" the prefix which constitutes a token
is the "aaabaaab".

In order to extract that token, the machine will have to consume
symbols all the way through the "aaac" which is excluded from the
lexeme. Only at the "c" character does it hit the situation that no
transition is available, and so the matching terminates. Then it has to
return to the most recent position in the input where the automaton had
been in an acceptance state.

It is not the same as the backtracking of a graph/tree search algorithm,
which returns to prior points to try alternatives, due to doing a
depth-first search.

At the point where we back up in the input, we are not trying
different branches of a depth-first-search. We have completed
a depth-first-search!

One input retrace at the conclusion of a succesful search is not
qualitatively the same situation as multiple, cascaded retraces
integrated into the search algorithm.

> Thus, what is implemented in LEX is essentially "longest, lexically first"
> matching. Where the lexer saves the last state where a token was accepted
> and continues attempting to match, and on failure backs up to the last
> recognized match (and then applies the rule which is lexically first that
> caused that match).

Yes; that is required in any regex matching function that searches
for a prefix, or infix.

The fundamental automaton's job is to recognize strings: does a given
input string belong to the language (set of strings) described by that

In lexical analysis, we solve the problem of, what is the longest prefix
of the input which belongs to that set of strings, the obvious way to do
which is to match as far as we can and then return to the most recent
good position.

> The only difference between this and what PEG (and PCRE) mechanisms
> do is that they implement "lexical first, longest" match.

I think when you say "lexical first", you might be referring to the
rules themselves: rules that are lexically earlier in the grammar
"win" over later rules.

(In Lex, that still matters. If two rules match tokens of exactly
the same length, the lexically earlier rule gets the token.)

This "lexical first, longest" is why the order of R1|R2|..|RN matters
in some so-called regex implementations.

Unfortunately, as far as regexes go, that is broken.

It may still be a language describing a regular set, or at least
in some cases, but the regex language for doing that requires the
disjunction operator to be commutative.

This is just historically so, but with justification.

The people who invented regular expressions decades ago, as far back as
the 1960's and earlier, and wrote papers about them decided on a
commutative disjunction, for good reasons.

A commutative disjunction is necessary in order to have a direct
relationship between sets and regular expressions.

If R1 describes a set of strings S1, and R2 describes a set of strings
S2, then R1|R2 describes S1 ∪ S2.

Because ∪ commutes, we want | to commute, since it means the same thing.

If | doesn't commute, then it doesn't correspond to the ∪ operator. The
relationship to sets has been severed or muddled.

TXR Programming Language:
Cygnal: Cygwin Native Application Library:

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.