Re: Interesting paper on regex NFA matching

Christopher F Clark <>
Mon, 29 Jan 2024 10:39:58 +0200

          From comp.compilers

Related articles
Interesting paper on regex NFA matching (John R Levine) (2024-01-25)
Re: Interesting paper on regex NFA matching (2024-01-26)
Re: Interesting paper on regex NFA matching (Kaz Kylheku) (2024-01-26)
Re: Interesting paper on regex NFA matching (Christopher F Clark) (2024-01-27)
Re: Interesting paper on regex NFA matching (Kaz Kylheku) (2024-01-27)
Re: Interesting paper on regex NFA matching (Christopher F Clark) (2024-01-29)
Re: Interesting paper on regex NFA matching (Kaz Kylheku) (2024-01-29)
Re: Interesting paper on regex NFA matching (Christopher F Clark) (2024-02-01)
Re: Interesting paper on regex NFA matching (Ev Drikos) (2024-02-04)
| List of all articles for this month |

From: Christopher F Clark <>
Newsgroups: comp.compilers
Date: Mon, 29 Jan 2024 10:39:58 +0200
Organization: Compilers Central
Injection-Info:; posting-host=""; logging-data="76233"; mail-complaints-to=""
Keywords: lex, NFA
Posted-Date: 29 Jan 2024 12:11:14 EST

Although, our esteemed moderator already corrected this error:
> [Patterns with alternatives and trailing contexts do have to back up
> and the flex manual warns it's quite expensive. See the

I still want to write an extended reply to this, where
Kaz Kylheku wrote:

> I think that case can be handled by technology like Lex trailing
> contexts, which doesn't require backtracking.
  > {digits}/. {
> // ... set up semantic value ..
> return INTEGER;
> }
> {digits}.{digits} {
> // ... set up semantic value ..
> return FLOAT;
> }
> Another possibility is to recongize {digits}. as one lexeme and call
> unput('.'). That's a minor form of backtracking, that doesn't require us
> to do anything funny to the regex implementation.

You may not recognize it as backtracking, but it is reading two characters
past the end of the integer token to disambiguate it. And while one character
of lookahead is generally "free" two characters requires more extensive
buffering and mechanisms to reset the state.

Thus, what LEX and flex are implementing is not strictly a pure
regular expression matcher, but one with a slight extension to what
can be recognized. While it is an extremely useful and appropriate
convention for the application area, it isn't strictly within the
bounds. The engine which runs such machines needs to have the hooks to
implement that extension.

One can prove this by using the hierarchy of language families where
regular expressions are a subset of LL(1) grammars which are a subset
of LR(1) grammars. If you try to write an LL(1) grammar for that, you
will find yourself unable to construct a proper FIRST set which disambiguates
the two rules Similarly, if one is attempting to make an LR grammar for it
uaing an algorithm that builds first an LR(0) machine, you will see that it
has a shift/reduce conflict on the "." character and one which cannot be
disambiguated by one character of lookahead.

Thus, what is implemented in LEX is essentially "longest, lexically first"
matching. Where the lexer saves the last state where a token was accepted
and continues attempting to match, and on failure backs up to the last
recognized match (and then applies the rule which is lexically first that
caused that match).

The only difference between this and what PEG (and PCRE) mechanisms
do is that they implement "lexical first, longest" match. That reorders the
set of matches that are considered. As the paper you dismissed notes,
that can be compared to depth-first, versus breadth-first search of the
tree of potential matches. Both strategies have their advantages and
areas where they are the proper choice and places where they are the
wrong choice.

If one finds oneself writing "trailing context" or "unput" in one's lexical
specification, one is using lookahead. And sometimes in lexers one
uses lookahead without even noticing it, as LEX doesn't emit warnings
when it does so. (We had to disable such warnings in our implementation
to get it to approximate LEX behavior.)


One final note is that language classes are determined by writing a
grammar which accepts (not lexes or parses) the given language.
That is one can write a grammar that doesn't disambiguate the cases
(i.e. combines them into one case), but then one hasn't correctly
tokenized it. I don't know of an automata theory (or compiler) book
that mentions this distinction, much less highlights its implications,
so I will forgive not knowing it unless one makes an argument that
depends upon it, which you did.

Chris Clark email:
Compiler Resources, Inc. Web Site:
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.