|parsing html? email@example.com (Ian) (2001-12-22)|
|Re: parsing html? firstname.lastname@example.org (Brock) (2001-12-24)|
|Re: parsing html? email@example.com (2001-12-27)|
|Re: parsing html? firstname.lastname@example.org (Robert Sherry) (2001-12-27)|
|Re: parsing html? email@example.com (Ian) (2001-12-29)|
|Re: parsing html? firstname.lastname@example.org (2002-01-24)|
|Date:||24 Dec 2001 00:08:08 -0500|
|Posted-Date:||24 Dec 2001 00:08:08 EST|
|[There is an official grammar for HTML, but it bears remarkably little
|relationship to the actual sloppy error-filled HTML that most web
|browsers manage to interpret. -John]
I recently decided to parse some html in a small project, see
http://deathonastick.org/projects/ocaml/mhtml/ and have a question.
Instead of parsing full html I just wanted to parse balanced-tags,
with explicit exeptions (whose end-tags if present would be
ignored). After playing with the grammar for a while for some reason I
decided to just parse out a stream of tags and text in a yacc-like-way
and then use a function to break the stream up into trees.
Point being I don't like it this way and think it should all be in the
yacc-step. If any of you get a chance could you look over at my
grammar (contained in parser.mly) and possibly at the functions (in
mhtml.ml) and give me some ideas of where I went wrong (or why the way
I did it is good)? Or perhaps I should extract the core grammar and
post that... maybe I will do that in a few days.
Anyway, the balanced-tag grammar would work great for the above
mentioned html parser (with awareness of comments and normal text and
one or two other things).
Return to the
Search the comp.compilers archives again.