Related articles |
---|
Has lexing and parsing theory advanced since the 1970's? costello@mitre.org (Roger L Costello) (2021-09-14) |
Re: Has lexing and parsing theory advanced since the 1970's? anton@mips.complang.tuwien.ac.at (2021-09-16) |
Re: Has lexing and parsing theory advanced since the 1970's? 480-992-1380@kylheku.com (Kaz Kylheku) (2021-09-17) |
Has lexing and parsing theory advanced since the 1970's? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2021-09-18) |
Re: Has lexing and parsing theory advanced since the 1970's? drikosev@gmail.com (Ev Drikos) (2021-09-29) |
From: | Kaz Kylheku <480-992-1380@kylheku.com> |
Newsgroups: | comp.compilers |
Date: | Fri, 17 Sep 2021 05:51:32 -0000 (UTC) |
Organization: | A noiseless patient Spider |
References: | 21-09-008 |
Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="96248"; mail-complaints-to="abuse@iecc.com" |
Keywords: | parse, history, comment |
Posted-Date: | 17 Sep 2021 11:37:26 EDT |
On 2021-09-14, Roger L Costello <costello@mitre.org> wrote:
> Lately I have been learning to use Flex & Bison. As I understand it, Flex &
> Bison (and other parser generators) are built on solid theory. As a result,
> those parser generators were created quickly and cheaply using structured,
> well-known techniques. Contrast with parsers developed prior to the theory:
> The earliest parser back in the 1950s used utterly ad hoc techniques to
> analyze the syntax of the source code of programs they were parsing. During
> the 1960s, the field got a lot of academic attention and by the early 1970s,
> parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many
> others put parsing techniques solidly on their theoretical feet.
>
> One of the first techniques they (Aho, Ullman, Knuth, and others) espoused was
> to separate lexing from parsing.
This is for one particular pragmatic reason. The LALR(1) parsing
technique uses one token of lookahead to make key parsing decisions.
If you combine lexing with parsing, then some of your original lookahead
symbols may turn into sequences of symbols, which share the same initial
element.
E.g. supose you have a "++" token and a "+=" token, which appear as a
lookahead symbol which determines which way to reduce.
If we don't separate lexing and parsing, such that we have a
character-level grammar, then we no longer have a "++" and "+="
token. Each of these is two grammar symbols.
And so in that same reduce situation, we no longer have two different
lookahead tokens for "++" and "+=". The lookahead symbol is just "+".
Fixing this requires techniques like multiple symbols of lookahead:
LR(k). (Not that that is new or anything.)
There are certain efficiencies that can be had in lexing, as such.
Dedicated lexical analysis can recognize tokens "blindingly" fast,
using buffering techniques integrated with the recognizer.
> Lexing built upon regular expressions, which
> built upon Finite Automata (FA) theory and Nondeterministic Finite Automata
> (NFA) theory. FA and NFA were brilliantly melded together with the famous
> Kleene Theorem. Parsing built on top of a rich theory of grammars -- Context
> Free Grammars, Context Sensitive Grammars, etc. -- that Chomsky formulated.
> Neat!
However, LR parsing is built on regular expressions again, with
some sophistication thrown in to handle nesting.
You know from ad hoc programming with regexes that you can handle some
forms of nesting if you invoke a regex more than once on a string.
For instance balanced parentheses can be recognized via regex-driven
algorithm if we us a regex to recognize all instances of an opening
parenthesis followed by non-parenthesis characters [^()] followed by a
closing parenthesis, and then reduce this by removing it from the input,
and iterate on it.
That sort of gives us an intuition behind LR parsing. At the core of a
LR parser is a state machine, and if we ignore the push down aspects of
it: how its actions maintain a stack, it's just a finite state machine
that can be described by a regex. The terminating states of the regex
basically rewrite something in the stack of symbols, or push something
into it, and then choose some state to jump to to iterate.
> That said, Flex & Bison is old. Has lexing/parsing theory advanced since the
> 1970’s? If yes, are there parser generators available today which are based on
> those advances in lexing/parsing theory? Or does Flex & Bison still represent
> the state-of-the-art in terms of the underlying theory it uses?
Firstly, Flex and Bison have enough power in them to handle
state-of-the-art *languages*.
Parsing theory has long outstripped the practical realm.
Loosel speaking, if you design a computer language whose syntax is so
complicated with difficult ambiguities that it benefits from being
parsed by some technique only available from a 2020 dated paper, it's
probably an awful language from a syntactic point of view, which
ill serves its users in that regard.
(Look at C++, by the way; it's famous for having a large, awful grammar.
Yet, the GNU C++ implementation is a hand-maintained recursive-descent
parser, that's about a megabyte and a half of C++ code all in one file.
So for all the complexity, C++ can be handled using approaches that take
advantage of approximately zero advances in parsing since 1965.)
Now Bison is actively developed. There are avenues of improvement in
dimensions other than parsing theory. Firstly, Bison does have some
theoretical muscle in it in the way of handling GLR grammars.
There are other pragmatic improvements in it like support for push
parsers, and re-entrant parsers.
It has support for multiple programming languages. Its C++ support goes
as far as supporting AST nodes that are objects which can have
destructors.
Fairly recently, support was added to Bison for multiple start symbols
in the same grammar, so you can parse just a subset of a language.
Another thing I believe I saw recently on the mailing list recently is
that contributions were merged to Bison which allow it to generate
counterexamples for ambiguities, so that it's easier to debug bad
grammars. Previous implementations of Yacc-like parsers just give you
diagnostics like "state 123: reduce-reduce conflict" whose root cause
can be hard to unravel.
> [Having been there in the 1970s I can report that the linguists and the computer
> scientists were dimly aware of each other but didn't work together and used
> different terms, e.g., linguists say phrase structure grammars where we would
> say CFG.
> Flex is a rewrite of lex which was a Bell Labs summer project to make
> it easier to turn regular expressions into DFAs (not NFAs) by Eric
> Schmidt. The code was awful, flex was a cleaner rewrite that avoided the
> AT&T software license. Bison was orignally a GPL rewrite of yacc but it
> has since added reentrant parsers and GLR parsing for ambiguous grammars.
> They knew about GLR in the 1970s, but Yacc had to run on a 16 bit PDP-11
> and the data structures were too big. -John]
Bison was originally written by the same person, Robert Corbett, who
went on to write Berkeley Yacc.
https://en.wikipedia.org/wiki/Berkeley_Yacc
Berkeley Yacc is far less actively maintained than Bison and has fewer
features. It does support reentrant parsing in a way that is
compatible with around Bison 2.5.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[No coincidence that it was the same guy, since bison was a fork of
byacc.
The lookahead example of ++ vs += isn't very good since a LALR
parser only needs to look ahead when it reduces but in general you're
right, it'd be a problem. A much worse problem in combined parsers is
getting rid of noise like comments and white space. In a separate lexer
you can easily discard them between tokens, in a combined lex/parse you
have to include them everywhere they can occur. -John]
Return to the
comp.compilers page.
Search the
comp.compilers archives again.