Re: Pattern Matching with Syntax Analyzer

"Ira Baxter" <>
13 Dec 2004 02:05:03 -0500

          From comp.compilers

Related articles
Pattern Matching with Syntax Analyzer (2004-12-05)
Re: Pattern Matching with Syntax Analyzer (Ira Baxter) (2004-12-13)
| List of all articles for this month |

From: "Ira Baxter" <>
Newsgroups: comp.compilers
Date: 13 Dec 2004 02:05:03 -0500
References: 04-12-026
Keywords: parse
Posted-Date: 13 Dec 2004 02:05:03 EST

"Reza Ferry" <> wrote
> Right now I'm trying to find patterns in a html page. For
> example I am looking for patterns in the form of:
> [snip] I am using a syntax analyzer (javacup) to do this.
> I have a Document which basically can consist of several paragraph
> start tags (Div, p, span), end tags, text, a tags, and separators
> (respectively b, e, t, a, s)
> a document is basically a combination of those tags (I don't care
> about the order in the document)
> D -> DC | empty
> C -> b|e|s|t|a
> That rule will enable me to accept any simplified html document
> However because I'm trying to match a particular pattern I must also
> detect the following rules
> H1 -> s b^m t* e^n s
> H2 -> s b^m t* s e
> H3 -> b s t* e^n s
> H4 -> b s t* s e
> b^m means a sequence of m number of 'b'

Do you insist on a specific numbers n and m?
Rather it appears that you are interested in n and m=1 and >1.
Would the rules

H1 -> s b+ t* e e+ s
H2 -> s b+ t* s e
H3 -> b s t* e e+ s
H4 -> b s t* s e

do the job? If so, then a "deterministic" parsing technology
(accepts first match, such as I beleive that JCup is,
better facts welcome) will likely do the trick.

If these rules overlap (can produce ambiguous parses), then you need a
parser that can provide *all* the possible parses ("matches"). GLR is
a good technology for that.

Our DMS Software Reengineering toolkit contains a GLR parser, and
could easily do this example.
Ira D. Baxter, Ph.D., CTO 512-250-1018
Semantic Designs, Inc.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.