RE: Wanted: XML/HTML Parser/Generator Library

Quinn Tyler Jackson <qjackson@shaw.ca>
17 May 2002 00:16:45 -0400

          From comp.compilers

Related articles
Wanted: XML/HTML Parser/Generator Library aarongray@beeb.net (Aaron Gray) (2002-05-13)
RE: Wanted: XML/HTML Parser/Generator Library qjackson@shaw.ca (Quinn Tyler Jackson) (2002-05-17)
Re: Wanted: XML/HTML Parser/Generator Library idbaxter@semdesigns.com (Ira Baxter) (2002-05-27)
| List of all articles for this month |

From: Quinn Tyler Jackson <qjackson@shaw.ca>
Newsgroups: comp.compilers
Date: 17 May 2002 00:16:45 -0400
Organization: Compilers Central
References: 02-05-069
Keywords: XML
Posted-Date: 17 May 2002 00:16:45 EDT

> I am also interested in generation of XML and of HTML, I know this is a
> trivial case, but a library that did both parsing and generation of either
> would be nice.
>
> Many thanks in advance,
>
> Aaron Gray
> [XML does seem to be this week's hot topic. Too bad it's such a mess to
> parse. -John]


For some time, I made the claim that XML was a perfect candidate for
adaptive grammar parsing because of the peculiarities of XML. One in
particular, is that XML is case sensitive, such that:


<tag> ... </taG> is not legal.


Also, HTML accepts things that HTML will not, such as tag attributes
that are not in quotation marks.


Traditionally, these issues were are dealt with by first parsing the
document using a forgiving parser, and then traversing the tree to
verify syntax.


In order to prove my claim that Meta-S could deal with this without ad
hockery (and because I needed an XML parser to load Meta-S grammars
saved in XML format ;-), I wrote an adaptive (HT|X)ML grammar that:


* Parses entirely without attached semantic checking code. (That is, in one
pass, without constructing a tree.
* Begins by parsing ??ML, and once it realizes what it is in (HTML or XML)
* Once a parse knows what kind of document it is in -- from that point forward,
the proper rules apply. (That is, once it realizes it is in XML, all tags must
be balanced on in <thisform/>, and all attributes must be in quotes).


It does not yet process <!THESE > but I will be adding that later, so that it
will validate tags, attributes, and entities and therefore do proper DTD
validation during the parse.


I ran some preliminary speed tests, and did a validating parse of an
XML version of the Old Testament (3 megs), counting tags and callbacks
in a few of the event callbacks, achieving these results on a 733 Mhz
Windows 2000 box.


----
Grammar accepted input.
Parse time: 53988 milliseconds
Parse speed: 64591 CPS
# Balanced tags: 25317
# Tag attributes: 1
# Character entities: 6
----


On the recommendation of Stroustrup, I created a pathological document
that was in this form:


<?xml version="1.0"?>
<tag>
      ... [three million dots]
</tag>


That is, the document was 3 megs, just like the old testament, but had
only two very-far-spaced-apart legal tags. I had these results:


-----
Grammar accepted input.
Parse time: 61 milliseconds
Parse speed: 50361279 CPS
# Balanced tags: 1
# Tag attributes: 1
# Character entities: 0
-----


I then changed the end tag to </taG>, and the document was rejected in
about the same amount of time milliseconds. (After 3.5 minutes of
waiting for IE 6.0 to load this document, I shut the browser down.)


I changed the "dots" document a bit, to make it a legal HTML document, as
follows:


<html>
<head>
</head>
<body>
<tag>
(the 3 million dots)
</tag>
</body>
</htmL> // I threw the L in just for fun


Results:


Parse time: 70 milliseconds
Parse speed: 43887057 CPS


I then created another pathological document in this form:


<?xml version="1.0"?>
<tag>
<tag0 foo="filler">
&quot;Some text&quot;
</tag0>
<tag1 foo="filler">
&quot;Some text&quot;
</tag1>
<tag2 foo="filler">
&quot;Some text&quot;
</tag2>
<tag3 foo="filler">
&quot;Some text&quot;
</tag3>
<tag4 foo="filler">
&quot;Some text&quot;
</tag4>


// up to


<tag19998 foo="filler">
&quot;Some text&quot;
</tag19998>
<tag19999 foo="filler">
&quot;Some text&quot;
</tag19999>
</tag>


With these results:


----
Grammar accepted input.
Parse time: 20820 milliseconds
Parse speed: 60414 CPS
# Balanced tags: 20001
# Tag attributes: 20001
# Character entities: 40000
----


So, it seems possible that my earlier claim that an efficient
validating XML parser without any hacks is a realistic use for
adaptive grammars. The XML/HTML parser used for the above tests will
be included in the next release of Meta-S, which is about to go into
beta.


--
Quinn Tyler Jackson
http://QuinnTylerJackson.n3.net/


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.