Re: XML Parsers (Push and Pull)

"Bill Rayer" <>
24 Jan 2002 14:59:03 -0500

          From comp.compilers

Related articles
XML Parsers (Push and Pull) (2002-01-18)
Re: XML Parsers (Push and Pull) (Bill Rayer) (2002-01-24)
Re: XML Parsers (Push and Pull) RLWatkins@CompuServe.Com (R. L. Watkins) (2002-01-24)
Re: XML Parsers (Push and Pull) (2002-01-28)
Re: XML Parsers (Push and Pull) (2002-01-28)
Re: XML Parsers (Push and Pull) (2002-02-06)
Re: XML Parsers (Push and Pull) (2002-02-16)
Re: XML Parsers (Push and Pull) (2002-02-28)
| List of all articles for this month |

From: "Bill Rayer" <>
Newsgroups: comp.compilers
Date: 24 Jan 2002 14:59:03 -0500
Organization: Virgin Net Usenet Service
References: 02-01-085
Keywords: parse, design
Posted-Date: 24 Jan 2002 14:59:03 EST

Hi there

> There are two main ways to parse XML, push which is event driven, and
> pull which is in memory. All material and documentation that I've
> read states that these are the two major ways of parsing XML, never
> does it state that these are the only ways.
> I guess my question is what are some of the other ways of parsing XML
> if any? Are there other parser implementations, even if they haven't
> been developed yet, and are only a concept.

I've heard of the event driven method, where the parser exists in a
separate library and you supply callback functions. The callbacks are
called when start tags, end tags etc are found in the XML. I'm not
familiar with the "push" or "pull" terminology you mention. However I
am writing an XML parser, so here are some comments you may find

1. The scanner was difficult. The XML specification uses EBNF but it
does not distinguish tokens in a way that makes it possible for them
to be gathered by a scanner. Whitespace in XML has its own production
(number [3]) and is scattered through all the other productions. I
found the only way to scan it was to pretend whitespace is allowed
between all tokens, and check that tokens are adjacent where the spec
disallows whitespace. Eg I scanned a start tag (STag [40]) as "<"
followed by an identifier, then checked there is no space between

2. I used a recursive descent parser for the actual syntax. This was
relatively easy, after reworking the syntax to use tokens that can be
scanned properly. The syntax of XML is unlikely to change, so
embedding the syntax in the code is OK in my view.

3. Error recovery was easy. The parser always knows what symbol it is
expecting, so if the current symbol is not in the allowed set it
stores the error details (description, line number, column position,
expected symbol etc). Only one error needs to be stored because the
XML spec disallows error recovery. And when a RDP detects an invalid
symbol, it terminates quickly because the current symbol doesn't match
what it expects and the recursive routines all return.

4. Performance was mediocre, about 800 lines/sec on a Pentium 90.
This is 10 to 20 times slower than the expat parser, which was written
in C by a code guru (James Clark). Mine isn't in C nor am I a code

5. The documentation (the XML spec) is not very helpful. It explains
things in a complicated way, and doesn't have useful examples. Also
the XML syntax is not helpful, I view it as an abuse of EBNF to use it
to specify languages on the basis of individual characters. As
another example, it has 3 different types of string (EntityValue [9],
AttValue [10], SystemLiteral [11]) which use different escape
notation. The only thing in its favour, it is quite detailed, and XML
is not that complicated a format. (if it is a subset of SGML, I would
really *hate* to write an SGML parser!).

Anyway I hope these notes help. I've spent several weeks writing this
parser and your posting was very timely.

Bill Rayer
lingolanguage <at> hotmail <dot> com

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.