Re: LL(1) grammar for regex

Chris F Clark <cfc@shell01.TheWorld.com>
27 Mar 2006 01:26:17 -0500

          From comp.compilers

Related articles
LL(1) grammar for regex not@www-no.ucsd.edu (James Brown \[MVP\]) (2006-03-22)
Re: LL(1) grammar for regex schmitz@i3s.unice.fr (Sylvain Schmitz) (2006-03-27)
Re: LL(1) grammar for regex cfc@shell01.TheWorld.com (Chris F Clark) (2006-03-27)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 27 Mar 2006 01:26:17 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 06-03-074
Keywords: parse
Posted-Date: 27 Mar 2006 01:26:17 EST

You are right. Your rule is wrong, well it's not LL(1).


However, so was your previous rule the one you presuambly took from
the web site, but you got away with it because your hand-translation
didn't match your rule. Your rules are not left-factored, meaning
each distinct alternative needs to start with a unique terminal, for
your grammar to be LL(1). Now, you got away with it, because your
implementation kind of fudged it and didn't decide which alternative
to apply at the beginning of parsing of the rule.


An LL(1) correct (i.e. left-factored) version of your rep rule would be:


rep ::= atom morerep


morerep ::= '*'
                    | /* empty */


As a theoretical aside, it is worth noting that your rules are fine
as-is for LR parsing or for LL(infinite) parsing, because neither of
them requires left factoring. However, for strict LL(k), you need to
factor so that the first k tokens are unique to each alternative. If
I calculate it right, your version with the . for concat is LL(2). I
think the version without the . is also, but I don't compute LL
parsers well in my head--I would use a tool to do so. Your hand
translation is close to the LL(infinite) translation of your rules,
but without the correct theoretical underpinnings, so that you have
introduced bugs, which is why your loop consumes all the rest of the
input.


My suggestion would be to use one of the *many* fine LL parser
generator tools that generate recursive descent parsers. You'll get a
correct parser (and a correct set of rules). There is no need to be
mucking around writing incorrect rules and then translating them
incorrectly and thinking you have something that works. Not when
there are simple to use tools that will help you get it right. You
are simply setting yourself up to make something that works for all
the cases you have tested it on, but not necessarily working on any
cases you didn't think of.


I apologize if this judgment seems unduly harsh. I would be less
harsh, if you were doing simple exercises to learn LL parsing. Then,
doing a few translations by hand makes sense at it will help inform
your intuition, so that when you later use a tool to do real work, you
can understand why the tool is telling you that you have made a
mistake. Moreover, making mistakes while doing practice exrecises
doesn't expose the world to bugs and can actually be a learning
experience.


However, for doing any real work, even just for oneself, one should
use a tool. In the long run, it is not only more correct, but it
allows one to work much faster.


To the point, I just knocked off a quick parser for some test scripts,
maybe 30 rules total, both lexer and parser. So, it was a small
grammar. By using a tool and cribbing parts from previous grammars, I
had the parser written in a couple of hours. If I wrote the same
thing by hand, I would have expected to take a couple of days and some
of the things that I was able to do because I had "canned" solutions
for, I would probably never have attempted in a hand version. Thus, I
have a better script reader because I used a tool and I have well over
a day to concentrate on other aspects of my job.


Further down the spectrum, if I had wanted an extensive scripting
language, something more like 100-200 rules, then even using a tool
would probably be the wrong approach. A better approach for an
extensive scripting language, is to use one of the canned
interpreters, such as lua and tcl, that already exist. It's the same
trade-off again. At some level of complexity, it is much faster to
use an existing script interpreter than it is to write ones own, even
with tool support.


Given that, why are you writing, yet another regex package? Isn't
there a canned one that does what you want? There have to be almost
as many regex packages as there are LL parser generators. I can only
hope you are doing so as part of a learning experience.


At this point I normally say, (and hope it doesn't sound sarcastic here)


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.