Re: parsing html?

Brock <rbw3@cet.nau.edu>
24 Dec 2001 00:08:08 -0500

From comp.compilers

Related articles
parsing html? iwaters@hg26.btclick.com (Ian) (2001-12-22)
*Re: parsing html? rbw3@cet.nau.edu (Brock)* (2001-12-24)**
Re: parsing html? ralph@inputplus.demon.co.uk (2001-12-27)
Re: parsing html? rsherry8@home.com (Robert Sherry) (2001-12-27)
Re: parsing html? iwaters@hg26.btclick.com (Ian) (2001-12-29)
Re: parsing html? somik@yahoo.com (2002-01-24)

| List of all articles for this month |

From:	Brock <rbw3@cet.nau.edu>
Newsgroups:	comp.compilers
Date:	24 Dec 2001 00:08:08 -0500
Organization:	Compilers Central
References:	01-12-140
Keywords:	parse
Posted-Date:	24 Dec 2001 00:08:08 EST

|[There is an official grammar for HTML, but it bears remarkably little
|relationship to the actual sloppy error-filled HTML that most web
|browsers manage to interpret. -John]

I recently decided to parse some html in a small project, see
http://deathonastick.org/projects/ocaml/mhtml/ and have a question.

Instead of parsing full html I just wanted to parse balanced-tags,
with explicit exeptions (whose end-tags if present would be
ignored). After playing with the grammar for a while for some reason I
decided to just parse out a stream of tags and text in a yacc-like-way
and then use a function to break the stream up into trees.

Point being I don't like it this way and think it should all be in the
yacc-step. If any of you get a chance could you look over at my
grammar (contained in parser.mly) and possibly at the functions (in
mhtml.ml) and give me some ideas of where I went wrong (or why the way
I did it is good)? Or perhaps I should extract the core grammar and
post that... maybe I will do that in a few days.

Anyway, the balanced-tag grammar would work great for the above
mentioned html parser (with awareness of comments and normal text and
one or two other things).

--Brock

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: parsing html?

Brock <rbw3@cet.nau.edu>24 Dec 2001 00:08:08 -0500

Brock <rbw3@cet.nau.edu>
24 Dec 2001 00:08:08 -0500