From: | spinoza1111@yahoo.com (Edward G. Nilges) |
Newsgroups: | comp.compilers |
Date: | 14 Oct 2001 22:27:09 -0400 |
Organization: | http://groups.google.com/ |
References: | 01-10-029 |
Keywords: | lex |
Posted-Date: | 14 Oct 2001 22:27:09 EDT |
"Jon Forrest" <forrest@ce.berkeley.edu> wrote:
> I've been looking at some pretty hairy regular expressions (in Perl)
> recently that are virtually unreadable. The thought crossed my mind
> that maybe there are some alternatives to classic regular expression
> syntax out there. Note that I'm only talking about the syntax - it's
> hard to beat the semantics of regular expressions.
>
> Does anybody know of anything?
Thanks for your request, Jon. I have been concerned for some time,
and the following is my contribution to the discussion.
Regular Expressions Considered Harmful
10-14-2001 Edward G. Nilges
Regular expressions ARE overly obfuscated and difficult to construct,
debug and verify. They often force the re coder to enter the same
regular subexpression several times and to verify that it is the same
in different locations. Frequently, regular expressions can look
valid and appear to do a parsing job while in actuality they fail in
subtle ways. The characters used are based on unix standards
developed some time ago and as a result regular expressions appear
even to the initiate as Runes carved by trolls from the Dark Ages.
Fortunately the theory behind regular expressions provides an escape
(sic) from these Runes. The original developers of regular
expressions, among them Brian Kernighan and Dennis Ritchie, appear to
be aware of work conducted at Princeton University and Bell Labs
around 1971 by, among others Jeffery Ullman and Jeff Hopcroft. In
Hopcroft and Ullman's 1973 book FORMAL LANGUAGES AND THEIR RELATION TO
AUTOMATA they were among the first to reveal the discovery that
regular expressions corresponded to a particular type of language,
"Chomsky Type 0" which they named in honor of MIT's Noam Chomsky who
is both a pioneer in linguistics and a political gadfly.
More important they show in the book an ascending hierarchy of
languages that contain each other, isomorphic to an ascending category
of formal mathematical machines that also nest. All "regular"
languages are specifiable by a regular expression, and can be parsed
by a finite automaton. One level up, there is a wider class of
languages (such as languages that use nested, balancing parentheses)
that are neither specifiable by an re nor computable by finite
automata. However, languages in this class are specifiable by a
Backus-Naur form formal grammar and computable by a coding of this
grammar…which can be generated "by hand" using (for example) a
straightforward recursive descent grammar, or automatically using
yacc.
The interesting thing is that ALL languages Chomsky type 0 can be
parsed using the BNF parser because its class contains all regular
expressions as a sidelight!
This has an immediate practical implication.
We can ELIMINATE regular expressions and replace them by Backus-Naur
grammars.
Backus-Naur grammars are more readable, by several orders of
magnitude, than regular expressions.
Consider the Runic regular expression for a telephone number. We code
for the case of a USA phone number with an optional three-digit area
code, a single space between the area code, a required three-digit
exchange, a required dash, and a required four-digit suffix.
^(\([0-9]{3}\)[ ]{1}){0,1}[0-9]{3}\-[0-9]{4}$ Yecchhhh
Note that this hard-to-read Rune casts in stone a form of
technological ethnocentrism, for it declares in a way that cannot
easily be changed that this is the ONLY phone number we will process.
Software internationalization becomes expensive to places like Taiwan
where population density expands the exchange to four digits, but
smaller size of the country reduces the area code to a two or one
digit city code.
I understand that if you know the notation, the change for my example
of Taiwan is easy, involving only the change of the max digit counts
{3} in two locations. But the change should be immediately obvious to
the person trained in RE technology and it is not. Furthermore,
regular expressions are now in common use in MIS venues where the
programmers did not major in computer science, and many trained VB.Net
and Visual Basic programmers may resent their mates who create
"unmaintainable" regular expressions, where "unmaintainable" has its
vernacular meaning, "I no understand."
Consider the BNF. Here, we have two types of "terminal" symbols:
upper case symbols considered as understandable from the BNF alone,
and obvious regular expressions, for we can assume that a regular
expression tool is available to the BNF parser generator.
phoneNumber := STARTOFINPUT phoneNumberBody ENDOFINPUT
phoneNumberBody := localPhoneNumber
phoneNumberBody := areaCode SPACE localPhoneNumber
areaCode := [0-9]{3}
localPhoneNumber := prefix DASH suffix
prefix := [0-9]{3}
suffix := [0-9]{4}
At a stroke this addresses all of the problems of regular expressions.
(1) BNF is more powerful. All regular expressions can be, per
Hopcroft and Ullman's 1973 results, coded as BNF.
(2) BNF is more readable. Each subsequence can be given a fully
meaningful name, and the multiple-line structure of BNF means that it
can be decorated with explanatory comments, both between lines and at
the end of lines.
(3) BNF can use regular expressions to unpack terminals. Regular
expressions cannot use BNF.
(4) Regular expression character conventions clash with XML, but
except insofar as BNF uses regular expressions, BNF is completely XML
friendly. This means that BNF can be readily combined with XML to
create powerful XML tags that, in the spirit of XML, self-document
languages contained in the rest of the tag.
Now, one practical reason for continuing to use the Runes, regular
expressions, is the existence of software to process regular
expressions. Regular expression processing is available for free in
Linux and commonly in unix. On Microsoft platforms, regular
expressions have long been available in C++ and are now available in
Visual Basic and VB.Net.
Software support for BNF that is commonly available is restricted to
yacc which is for the most part tied to C and insufficiently
incremental in that most yacc inplementations generate the complete
parser for the complete language. Therefore I am developing a
package, implemented in VB.Net for the Common Language Runtime, to
generate recursive descent parsers from BNF.
However, it should be noted that a skilled programmer DOES NOT NEED A
TOOL to generate a rugged parser from the BNF alone. Simply start at
the start symbol of the BNF (here, phoneNumber.) Write a Boolean
function in your favorite language to return True if the grammar class
of the symbol is found, False otherwise. This function needs to be
passed, by reference, an index to the input: by value, the input
string, and, by value, the end of the input string currently parsed.
Note that the last parameter is needed whenever we recursively parse a
parenthesized expression, due to a limitation in recursive descent
that forces us, in certain contexts, to find a balancing right
parenthesis with a separate scan.)
The toplevel code for the example looks APPROXIMATELY like this in
VB.Net (I understand that VB. is considered to be for Neanderthal Men
and that most compiler work uses C and C++. Having used the latter
languages extensively I prefer VB.Net because in my platform of
current choice, Microsoft's Common Language Runtime, C++ preserves too
many legacy hooks and snags to COM, and the CLR factors out ANY
performance differences. You can now write production compilers in
Visual Basic.)
Private Function parsePhoneNumber(ByVal strInstring As String, _
ByRef intIndex As Integer, _
ByVal intEndIndex As Integer) _
As Boolean
areaCode(strInstring, intIndex, intEndIndex)
Return(localPhoneNumber(strInstring, intIndex, intEndIndex)
End Function
Note that each symbol in the grammar is implemented by a function such
as localPhoneNumber and that these functions advance the index.
Sometimes their higher-level caller calls them merely to advance the
index if the category exists at all, as in the call of areacode.
Other times, as in the call of localPhoneNumber, the grammar category
is required so we use an actual function call.
The original motivation for this work was the need for intelligent
comparision of files with syntax as in the case where you desire to
compare source code. A variety of rather half-assed solutions to this
problem (such as WinDiff) exist which replace thought by pretty
graphics that in the real world generate dog-work: for the pretty
graphics are for the most part non-reusable bit maps that are based on
the silly belief that source code comes in lines and not in syntax
categories: consider the difference between "lines 100-1432 are just
different, and good luck to you" and "in the alpha procedure the
symbol beta needs to be replaced by theta."
One reason why "regular expressions" are considered an advanced
feature to be used with fear and awe by harried coders, who lack the
cultured leisure needed to understand what they are doing, is a form
of Platonic idealism in which the programming manager believes that if
a package such as WinDiff presents a beautiful picture, the programmer
need only labor in this world of pure forms happily, ever after.
The reality is that software correctness is not Platonic but is
constituted by labor. Thus the difference between two files is not a
pure form at all, but the minimal amount of labor needed to transform
file a to file b expressed in the terms of the syntax of the language
of the files.
The fact that I express this as both minimal and as labor may
scandalize the managerial mind for it sounds almost Marxist and like a
highclass form of goofing off. Note that real time saving isn't
constituted in getting a bitmap but in mathematically minimizing
unnecessary steps. Note that my focus happens to be on minimizing the
labor of the person who must intelligently account for file
differences.
Regular expressions have their place, indeed I use them in BNF to
define terminals. Unfortunately, like many "reusable packages" which
have, in MIS, a systematic tendency to turn upon their programmer
users in a form of alienation, the spirit of fear and awe in which
this Runic language is approached turns good programmers into trolls
from the dark ages in what I consider is a dialectical reversal of
enlightenment.
[Yes, we all know that regular expressions are a subset of BNF, and
it's certainly possible to write regexps using BNF. But the important
difference is that you can translate a regexp into a simple state
machine, while BNF requires, depending on how full a version of BNF
you want to handle, at least a state machine with a state stack and
possibly a full Earley or Tomasulo parser which is large and slow.
Lexers and parsers are both useful, but they're not the same
thing. -John]
Return to the
comp.compilers page.
Search the
comp.compilers archives again.