Flex is the most powerful lexical analysis language in the world. True or False?

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Thu, 5 May 2022 15:20:07 +0300

          From comp.compilers

Related articles
| List of all articles for this month |

From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Thu, 5 May 2022 15:20:07 +0300
Organization: Compilers Central
References: 22-05-003
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="97953"; mail-complaints-to="abuse@iecc.com"
Keywords: flex, history
Posted-Date: 05 May 2022 15:12:28 EDT

First, of all, I think Flex is a relatively fine lexer generator with
some good properties and reasons why it is popular, but claiming it is
"the most powerful lexical analysis language in the world" is way too
far overstating it.


It is not even as powerful as some other lexical analysis languages
and even exploiting its power often requires things that cannot be
expressed in Flex alone nor can they be done in ways that are simple
and easy to reason about. So, that part of the statement is patently
false. I will answer the question, but go beyond simple, yes and no
responses to show the actual issues and not some over-simplification
of the problem.


> 1. A lexical analysis language that exclusively provides regular expressions
> for scanning input can only process regular languages.


True if you mean automata theoretic regular expressions, i.e. not
including PCRE extensions, like back references and captures and
zero-width assertions. However, if you include such extensions (or
other ones I will mention below), you can escape those limits and get
to larger language classes, including ones that are Turing
Complete--not that Turing Machine power cannot be abused and result in
a lexer who you cannot figure out what the tokens are without solving
the Halting Problem. So, what one really wants is a nicely constrained
but larger lexing language that is still easy to reason about. There
are tools that approximate that. None are perfect yet (if perfection
is even achievable and not merely a limit to be approached but never
reached).


> 2. Flex provides, in addition to regular expressions, states and a pushdown
> stack. This greatly expands the set of languages that can be processed.
> 3. Because Flex provides states and a pushdown stack, Flex lexers can process
> context-free languages.


True and true. States by themselves don't add anything. The
intersection or union of two regular expression languages is still
regular. However, a pushdown automata can allow you to process a
context free language. Now, not having used that feature myself, I
don't know whether one has to "program" the automata by hand or one
can express it "naturally" as context free rules. I suspect the
latter, but don't know for sure.


In comparison, in 1986, when we built the lexer for our version of
Yacc++, we integrated regular expressions and LR rules, so that one
can simply write yacc-like rules with recursion to express
[deterministic, i.e. ELR] grammars (with regular expressions on the
right hand side). To my mind, the notation being unified and used
both in the lexer and the parser is easier to comprehend and within
reason is much easier to reason about. It makes things like nested
comments into an "obvious" recursive lexical rule.


We also have lexer classes which allow one to do lexer states (for
context sensitivity) with inheritance between them and you can change
their states (which class is being lexed) via rules in the lexer or
parser, doing the things that Flex does.


Terence Parr liked our idea and eventually incorporated it into ANTLR
(originally doing an LL version of it), the current version in ANTLR4
is more like a PEG (parsing expression grammar) with extensions to
handle simple (aka direct) left-recursion and precedence.


Moreover, he has some simple lexer extensions that look quite FLEX
like (rules for changing "state" in a controlled fashion including
pushing/popping from a stack). Being more constrained, it is actually
easier to reason about. You aren't trying to analyze arbitrary C
(Java) code.


> 4. No other lexical analysis language provides states and a pushdown stack.
> 5. Flex is the most powerful lexical analysis language in the world.


False and false. See the preceding paragraphs. Not only do they do
so. They do so in a manner which is more easy to reason about.


Moreover, syntactic (and semantic) predicates (an innovation Terence
introduced and we borrowed later) allow one to express languages that
are larger than the context free family. In fact, there are proofs
that one can use them to express any Turing Complete language. They
are typically implemented using backtracking.


However, Bryan Forward took the concept of predicates and developed
packrat parsing as a way of implemented PEGs efficiently, effectively
linear. And, the notation is only minorly different than context free
grammars, with the main difference being that the or (alternative) is
treated as ordered, if the first rule matches apply it and don't even
try the following rules. As I noted above, this nicely solves certain
precedence problems. (And, as I said, ANTLR4 incorporates this idea
to positive effect.)


This work has further been extended by Alexander Okhotin into a
variety of "Conjuctive" and "Boolean" grammars. These again have
Turing Machine power, but are usually simpler to understand and reason
about than most Turing Machines.


In a different direction researchers like Quinn-Tyler Jackson have
developed Adaptive Grammars (see the relevant Wikipeida article).
This is a different solution to the problem of making a constrained
version of Turing Machine power. Notably they are well designed to
handle "type checking" with the equivalent of the C typedef hack, but
without it being a hack but a formalism that can give clear semantics.
Note that this is traditionally very hard to do in context free
grammars--and as far as I can tell, most people writing lexers and
parsers flounder trying to incorporate even simple type checking into
their grammar, by using the wrong tool (context free grammars) to do
so.


There are other directions that have been pursued also. Scanner-less
parser generators, which merge the lexing and parsing into one phase.
GLR parsers (and presumably lexers) that deal with ambiguity by making
forests rather than just trees/DAGs. Parsing combinators can be used
for lexing also. There are probably even concepts that I have never
read about or have read about and forgotten.


Many of these advances have been incorporated into lexer generators.
They are [almost] all more powerful lexical analysis languages than
what Flex provides. Most of them are more clear and easier to reason
about also.


So, pardon my knee-jerk reaction to your question. It is simply that
most people aren't aware of all the different technologies that are
available. Flex is fine, but it is certainly not a be-all-end-all
solution. It wouldn't even be the solution I would reach for first.


Kind regards,
Chris


--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.