Related articles |
---|
HTML grammar jantezan@comteco.entelnet.bo (Israel Antezana Rojas) (1999-09-16) |
Re: HTML grammar aserafin@post.pl (Andrzej Serafin) (1999-09-20) |
Re: HTML grammar Ralf.Gerlich@t-online.de (Ralf Gerlich) (1999-09-20) |
From: | Ralf Gerlich <Ralf.Gerlich@t-online.de> |
Newsgroups: | comp.compilers |
Date: | 20 Sep 1999 12:01:35 -0400 |
Organization: | T-Online |
References: | 99-09-059 |
Keywords: | parse |
Hi!
> I am trying to build an HTML parser, please if somebdy has already
> written an HTML grammar send it to me!.
You may probably find a definition of the HTML "grammar" at the W3C
page(www.w3c.org)
In fact HTML has a rather sloppy grammar. Parsing should normally be
done in two levels:
1. Generally decide which input is text and which is a tag. Parse the
tags by dividing their contents into words and arguments.
2. Now you need a system to check those "errorneous" constructs(which
are in fact supported by the grammar)
Therefore you need a definition for each type of block that contains
this data:
1. the name of the starting command
2. is it a block?(just think of IMG tags: they ain't got a "closer")
3. May either the starting or the ending tag or both be omitted?
(For an example of such a definition you should perhaps have a look at
how SGML or XML work)
According to this definition you can now generate a "parser" which
synchronizes itself by implicitly inserting missing start and end tags
where possible.
A good example of this _may_ be SGMLtools (http://www.sgmltools.org/).
They have C code which _may_ help you(I haven't had a look at it yet,
but they are in fact doing a "pretty print" of the SGML code according
to a definition, adding missing start and end tags where possible, thus
getting correct "code" to send to the real parser)
I hope this helps a bit(sorry I didn't go more into depth but I don't
have much time to answer and also this is only an idea of mine which is
not tested or implemented in any way yet)
Ciao,
Ralf
--
Ralf Gerlich Ralf.Gerlich@t-online.de
Passionate programmer http://www.d-design.net/rgerlich/
Return to the
comp.compilers page.
Search the
comp.compilers archives again.