Re: regular expression question

"Skandinavisches Seminar" <skandgoe@gwdg.de>
10 Jun 2005 22:17:02 -0400

          From comp.compilers

Related articles
[5 earlier articles]
Re: regular expression question nicola.musatti@gmail.com (Nicola Musatti) (2005-06-09)
Re: regular expression question cfc@shell01.TheWorld.com (Chris F Clark) (2005-06-09)
Re: regular expression question snicol@apk.net (Scott Nicol) (2005-06-10)
Re: regular expression question snicol@apk.net (Scott Nicol) (2005-06-10)
Re: regular expression question d148f3wg02@sneakemail.com (Karsten Nyblad) (2005-06-10)
Re: regular expression question torbenm@diku.dk (2005-06-10)
Re: regular expression question skandgoe@gwdg.de (Skandinavisches Seminar) (2005-06-10)
Re: regular expression question mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2005-06-12)
| List of all articles for this month |

From: "Skandinavisches Seminar" <skandgoe@gwdg.de>
Newsgroups: comp.compilers
Date: 10 Jun 2005 22:17:02 -0400
Organization: GWDG, Goettingen
References: 05-06-04505-06-050 05-06-053
Keywords: lex
Posted-Date: 10 Jun 2005 22:17:02 EDT

"Gijs" <gvheurn@gmail.com> schrieb im Newsbeitrag
> So I was hoping to easily invert a RE. But now I think I'll get a
> rather complex RE, which I think the database field is too small
> for. ;-) [...] if anyone has some other ideas, I'll really
> appreciate this!


I don't quite understand what the use is of matching everything except
for a certain string. As far as "other ideas" are concerned, my
suggestion would be to consider whether using regular expressions is
really the best way of getting what you want. Regular expressions are
great, but I think it's sometimes better to collect "symbolic tokens"
character-by-character instead. This is how Donald Knuth implemented
scanning in TeX and Metafont, and I've "borrowed" the technique for
GNU 3DLDF. It's simple, works quite nicely, is easy to implement, and
there's no backtracking (unless one considers a single character of
lookahead backtracking).


The basic idea is that each valid character (i.e., valid for a given
application) is either associated with terminal symbol, i.e., its
meaning is "hardwired", or it belongs to a "category".


For example, in GNU 3DLDF, letters and '_' belong to category 0,
'<', '=', '>', ':', and '|' belong to category 1, etc. A "symbolic token"
can only consist of characters belonging to a single category. So as
soon as a character from another category is found, the scanner stops
collecting and proceeds to the next step. This can differ depending on the
category of the symbolic token. It might return the category code
immediately to the parsing function or it might look up the name formed by
the
characters in a table. If the character is '"', it might collecting
additional characters for a string until a terminating '"' is found. There
are other possibilities as well.


If you find this idea interesting, my code is available at:
http://savannah.gnu.org/cgi-bin/viewcvs/3dldf/3dldf/Group/CWEB/scan.web


Laurence Finston
http://www.gnu.org/software/3dldf/LDF.html
lfinsto1@gwdg.de


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.