Regular expressions in lexing and parsing

Ed Davis <ed_davis2@yahoo.com.dmarc.email>
Fri, 17 May 2019 13:18:34 +0000 (UTC)

          From comp.compilers

Related articles
Regular expressions in lexing and parsing ed_davis2@yahoo.com.dmarc.email (Ed Davis) (2019-05-17)
Regular expressions in lexing and parsing jamin.hanson@googlemail.com (Ben Hanson) (2019-05-18)
Re: Regular expressions in lexing and parsing DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2019-05-21)
Re: Regular expressions in lexing and parsing Fortran sgk@troutmask.apl.washington.edu (steve kargl) (2019-05-22)
Re: Regular expressions in lexing and parsing drikosev@gmail.com (Ev. Drikos) (2019-05-23)
Re: Regular expressions in lexing and parsing christopher.f.clark@compiler-resources.com (Christopher F Clark) (2019-06-17)
Re: Regular expressions in lexing and parsing quinn.jackson@ieee.org (Quinn Jackson) (2019-06-18)
[3 later articles]
| List of all articles for this month |

From: Ed Davis <ed_davis2@yahoo.com.dmarc.email>
Newsgroups: comp.compilers
Date: Fri, 17 May 2019 13:18:34 +0000 (UTC)
Organization: Compilers Central
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="20434"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, parse, comment
Posted-Date: 17 May 2019 10:57:27 EDT

This is a blog post from Rob Pike, of Plan-9 and Go fame - I found it pretty
interesting.


Regular expressions in lexing and parsing
August 22, 2011


Comments extracted from a code review. I've been asked to disseminate them
more widely.


I should say something about regular expressions in lexing and parsing.
Regular expressions are hard to write, hard to write well, and can be
expensive relative to other technologies. (Even when they are implemented
correctly in N*M time, they have significant overheads, especially if they
must capture the output.) Lexers, on the other hand, are fairly easy to write
correctly (if not as compactly), and very easy to test. Consider finding
alphanumeric identifiers. It's not too hard to write the regexp (something
like "[a-ZA-Z_][a-ZA-Z_0-9]*"), but really not much harder to write as a
simple loop. The performance of the loop, though, will be much higher and will
involve much less code under the covers. A regular expression library is a big
thing. Using one to parse identifiers is like using a Mack truck to go to the
store for milk. And when we want to adjust our lexer to admit other character
types, such as Unicode identifiers, and handle normalization, and so on, the
hand-written loop can cope easily but the regexp approach will break down.


A similar argument applies to parsing. Using regular expressions to explore
the parse state to find the way forward is expensive, overkill, and
error-prone. Standard lexing and parsing techniques are so easy to write, so
general, and so adaptable there's no reason to use regular expressions. They
also result in much faster, safer, and compact implementations.


Another way to look at it is that lexers and parsing are matching
statically-defined patterns, but regular expressions' strength is that they
provide a way to express patterns dynamically. They're great in text editors
and search tools, but when you know at compile time what all the things are
you're looking for, regular expressions bring far more generality and
flexibility than you need.


Finally, on the point about writing well. Regular expressions are, in my
experience, widely misunderstood and abused. When I do code reviews involving
regular expressions, I fix up a far higher fraction of the regular expressions
in the code than I do regular statements. This is a sign of misuse: most
programmers (no finger pointing here, just observing a generality) simply
don't know what they are or how to use them correctly. Encouraging regular
expressions as a panacea for all text processing problems is not only lazy and
poor engineering, it also reinforces their use by people who shouldn't be
using them at all.


So don't write lexers and parsers with regular expressions as the starting
point. Your code will be faster, cleaner, and much easier to understand and to
maintain.
--


[Orignally here https://commandcenter.blogspot.com/2011/08/regular-expressions-in-lexing-and.html


It took me a minute to figure out what he was saying, since the patterns in a lexer are regular
expressions. I believe the point is not to use general purpose regex libraries but rather to
use something like flex or re2c which will take a set of expressions and actions and
precompile them. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.