Re: parsing html?

Brock <rbw3@cet.nau.edu>
24 Dec 2001 00:08:08 -0500

          From comp.compilers

Related articles
parsing html? iwaters@hg26.btclick.com (Ian) (2001-12-22)
Re: parsing html? rbw3@cet.nau.edu (Brock) (2001-12-24)
Re: parsing html? ralph@inputplus.demon.co.uk (2001-12-27)
Re: parsing html? rsherry8@home.com (Robert Sherry) (2001-12-27)
Re: parsing html? iwaters@hg26.btclick.com (Ian) (2001-12-29)
Re: parsing html? somik@yahoo.com (2002-01-24)
| List of all articles for this month |
From: Brock <rbw3@cet.nau.edu>
Newsgroups: comp.compilers
Date: 24 Dec 2001 00:08:08 -0500
Organization: Compilers Central
References: 01-12-140
Keywords: parse
Posted-Date: 24 Dec 2001 00:08:08 EST

|[There is an official grammar for HTML, but it bears remarkably little
|relationship to the actual sloppy error-filled HTML that most web
|browsers manage to interpret. -John]


I recently decided to parse some html in a small project, see
http://deathonastick.org/projects/ocaml/mhtml/ and have a question.


Instead of parsing full html I just wanted to parse balanced-tags,
with explicit exeptions (whose end-tags if present would be
ignored). After playing with the grammar for a while for some reason I
decided to just parse out a stream of tags and text in a yacc-like-way
and then use a function to break the stream up into trees.


Point being I don't like it this way and think it should all be in the
yacc-step. If any of you get a chance could you look over at my
grammar (contained in parser.mly) and possibly at the functions (in
mhtml.ml) and give me some ideas of where I went wrong (or why the way
I did it is good)? Or perhaps I should extract the core grammar and
post that... maybe I will do that in a few days.


Anyway, the balanced-tag grammar would work great for the above
mentioned html parser (with awareness of comments and normal text and
one or two other things).


--Brock


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.