Re: Regular expressions in lexing and parsing

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Mon, 17 Jun 2019 08:37:44 -0400

          From comp.compilers

Related articles
Regular expressions in lexing and parsing ed_davis2@yahoo.com.dmarc.email (Ed Davis) (2019-05-17)
Regular expressions in lexing and parsing jamin.hanson@googlemail.com (Ben Hanson) (2019-05-18)
Re: Regular expressions in lexing and parsing DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2019-05-21)
Re: Regular expressions in lexing and parsing drikosev@gmail.com (Ev. Drikos) (2019-05-23)
Re: Regular expressions in lexing and parsing christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-06-17)
Re: Regular expressions in lexing and parsing quinn.jackson@ieee.org (Quinn Jackson) (2019-06-18)
Re: Regular expressions in lexing and parsing quinn.jackson@ieee.org (Quinn Jackson) (2019-06-18)
Re: Regular expressions in lexing and parsing 847-115-0292@kylheku.com (Kaz Kylheku) (2019-06-18)
Re: Regular expressions in lexing and parsing christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-06-18)
| List of all articles for this month |

From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Mon, 17 Jun 2019 08:37:44 -0400
Organization: Compilers Central
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="72427"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, design
Posted-Date: 18 Jun 2019 13:09:03 EDT

Hans-Peter Diettrich <DrDiettrich1@netscape.net> wrote:


> I think that regular expressions are primarily usable in *formal* token
> grammars. In contrast to parsers I'm missing *complete* formal grammars
> for the tokens, including whitespace, comments, strings and other
> special language elements, and how these fit together.


I definitely would not discount the use of formal grammars both for
lexing (i.e. formal regexes) and parsing and the use of tools.
When we did Yacc++, we specifically made our lexer syntax look like
our parser syntax (and used LR yacc-style productions extended with
regular expressions on the right hand side) to make the notation
formal, regular, and simple. I would call that one of our successes.


One of the advantages of having a consistent notation and using it in
a formal manner is that often grammars can be taken almost verbatim
from a standard and the parts that are missing easily inferred. It's
amazing how much most lexers *should* look like each other. And, that
was one of our goals to standardize some of the "tricks of the trade"
(or hacks if you prefer) so that they became part of the standard
lexicon and could be reused as languages evolved.


Not everyone needs PL/I style keywords, but it turns out they are more
useful than one would think for a dead language, so standardizing how
that is done turns it into less of a hack and allows better reasoning
about it. Same argument can be made for the different forms of
comments, ones that nest and ones that don't. Nesting comments aren't
hard if your lexer actually can do LR lexing and not just regular
expressions. However, the point is to gather such things together and
implement them in a standardized and verifiable way, so that there is
some level of formal correctness.


And, that would be my criticism of hand written recursive descent
parsers. You can do amazing things with them, but you also can
produce an unmaintainable mess where no one is quite sure what is
legal and what is not. Reading the source to the compiler is not the
solution to that problem.


Anyway, I just wanted to point out, that by doing the above when I sat
down to write the front end for the Intel Vmod project. I took the
1995 Verilog standard and almost copied it verbatim as the source of
the lexer and parser. It took about 2 days to do so, but when I was
done, I had a roughly working parser and after a couple of weeks it
was essentially fully debugged. That's the advantage of using a
formal style grammar and tools. You get something that actually works
for some relatively obscure edge cases quite quickly and you can be
confident of it. And, no it isn't just me that can do that, one our
first customers implemented SGML out of the box, the same way and I
cannot recall how many times I worked with customers implementing SQL.
There are some corner cases having to do with joins that are hard to
get right. The main point is that when you are done, you have
something reasonably readable that you can have confidence in.


So, if someone were to seriously argue against formal grammars (and
formal lexing specs too), I would challenge them to prove their
contentions.


Yes, I can see that calling libPCRE with strings that have lots of
backreferences, assertions, fiddling with greedy v. non-greedy
operators etc. is not particularly maintainable. However, a simple
lexer, with simple regular expressions following patterns that have
been shown to work in language after language is worth striving for
(and in many cases not that difficult).


I will let someone else have the soapbox now....


--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.