Related articles |
---|
ANTLR Grammar Question akhailtash@gmail.com (Amal) (2007-05-14) |
Re: ANTLR Grammar Question cfc@shell01.TheWorld.com (Chris F Clark) (2007-05-16) |
Re: ANTLR Grammar Question idbaxter@semdesigns.com (Ira Baxter) (2007-05-18) |
From: | "Ira Baxter" <idbaxter@semdesigns.com> |
Newsgroups: | comp.compilers |
Date: | 18 May 2007 00:29:00 -0400 |
Organization: | NewsGuy - Unlimited Usenet $19.95 |
References: | 07-05-052 07-05-063 |
Keywords: | PCCTS, lex |
Posted-Date: | 18 May 2007 00:28:59 EDT |
"Chris F Clark" <cfc@shell01.TheWorld.com> wrote in message
> Amal <akhailtash@gmail.com> writes:
>
> This is a very typical thorny problem with regular expressions. It
> comes up all the time when you want to swallow all the text (sigma*) upto
> some multiple character end-marker. Another example of this is C
> comments which can contain any character sequence except "*/" (a two
> character end-marker). [snip]
>
> In theory, one should just write:
> TK_TEXT: .* - "$end" ; // . is the typical notation for sigma
>
> That works (see below) in theory because regular expressions are
> closed under complement and interestion, thus you can take the set
> difference. Unfortunately, very few (perhaps no) lexer/parser
> generators implement a difference operator (directly).
>
> [The underappreciated re2c lexer generator has a difference operator. -John]
DMS doesn't have a regular expression difference operator, but does
have complement and intersection, and these can be used to compute the
difference easily. This kind of comment is common that we use an
idiom involving these two operators.
DMS's way of describing the classic C comment is:
#macro arbitrarystring "~[]*"
#macro start " \/ \* "
#macro end " \*\/ "
#token comment " <start>
( <arbitrarystring>
&& ~~ <arbitrarystring> <end>
<arbitrarystring> )
<end>"
I've used macros to make it more readable.
The interesting operators here are "~~", which computes
the complement of the following regexp, and ""&&",
which computes the intersection of two regexps.
The basic idea is that comment body is any arbitrary
string of characters that doesn't contain
the comment end marker. The idea of "contain"
is modelled by the phrase following the "~~".
Amal's example would be:
#macro newline " \d\n? | \n "
#macro arbitrarystring "~[]*"
#macro end "\$end"
#token comment " \$comment <newline>
( ~[]*
&& ~~ ~[]* <newline><end> ~[]* )
<newline><end>"
One of the nastier entities to lex is one of PHP's string literal types,
in which the string quotes are user supplied:
<TNEMMOC
...arbitrary text...
>TNEMMOC
Regular expressions just don't help here.
This simply requires an ugly lookhead hack that
recognizes the prefix quote by use of a regular expression,
and then invokes a procedure that repeatedly
collects one more character,
and matches the end of the collected set
against the captured string quote. Ick
-- IDB
Return to the
comp.compilers page.
Search the
comp.compilers archives again.