|Pattern Matching with Syntax Analyzer email@example.com (2004-12-05)|
|Re: Pattern Matching with Syntax Analyzer firstname.lastname@example.org (Ira Baxter) (2004-12-13)|
|From:||"Ira Baxter" <email@example.com>|
|Date:||13 Dec 2004 02:05:03 -0500|
|Posted-Date:||13 Dec 2004 02:05:03 EST|
"Reza Ferry" <firstname.lastname@example.org> wrote
> Right now I'm trying to find patterns in a html page. For
> example I am looking for patterns in the form of:
> [snip] I am using a syntax analyzer (javacup) to do this.
> I have a Document which basically can consist of several paragraph
> start tags (Div, p, span), end tags, text, a tags, and separators
> (respectively b, e, t, a, s)
> a document is basically a combination of those tags (I don't care
> about the order in the document)
> D -> DC | empty
> C -> b|e|s|t|a
> That rule will enable me to accept any simplified html document
> However because I'm trying to match a particular pattern I must also
> detect the following rules
> H1 -> s b^m t* e^n s
> H2 -> s b^m t* s e
> H3 -> b s t* e^n s
> H4 -> b s t* s e
> b^m means a sequence of m number of 'b'
Do you insist on a specific numbers n and m?
Rather it appears that you are interested in n and m=1 and >1.
Would the rules
H1 -> s b+ t* e e+ s
H2 -> s b+ t* s e
H3 -> b s t* e e+ s
H4 -> b s t* s e
do the job? If so, then a "deterministic" parsing technology
(accepts first match, such as I beleive that JCup is,
better facts welcome) will likely do the trick.
If these rules overlap (can produce ambiguous parses), then you need a
parser that can provide *all* the possible parses ("matches"). GLR is
a good technology for that.
Our DMS Software Reengineering toolkit contains a GLR parser, and
could easily do this example.
Ira D. Baxter, Ph.D., CTO 512-250-1018
Semantic Designs, Inc. www.semdesigns.com
Return to the
Search the comp.compilers archives again.