Re: Sugestions wanted for XML parser

Joachim Durchholz <joachim_d@gmx.de>
13 May 2002 01:04:16 -0400

          From comp.compilers

Related articles
Sugestions wanted for XML parser kaizer@mail.pt (kz) (2002-05-12)
Re: Sugestions wanted for XML parser joachim_d@gmx.de (Joachim Durchholz) (2002-05-13)
Re: Sugestions wanted for XML parser jimbo@radiks.net (2002-05-13)
| List of all articles for this month |
From: Joachim Durchholz <joachim_d@gmx.de>
Newsgroups: comp.compilers
Date: 13 May 2002 01:04:16 -0400
Organization: Compilers Central
References: 02-05-065
Keywords: XML, parse
Posted-Date: 13 May 2002 01:04:16 EDT

kz wrote:
>
> This would transform something like:
>
> <clients>
> <client>
> <id>1</id>
> </client>
> <client>
> <id>2</id>
> </client>
> </clients>
>
> into an array that looks something like this:
>
> structure[tag_name] = clients
> structure[tag_id] = 1
> structure[tag_value] = (none)
> structure[tag_parent] = (none)
>
> structure[tag_name] = client
> structure[tag_id] = 2
> structure[tag_value] = (none)
> structure[tag_parent] = 1
>
  > [...]
>
> So, basicly... is this a good way to "go for it" ?


Points to note:


* The above is a scanner (aka lexer), not a parser. A parser generates a
tree structure. (Just to keep the terminology clean.)
* You're storing everything in an array. While there is nothing wrong
about that, you'll have to allocate the array before you know how large
your input is, i.e. the size of the input that you can process is
limited. This is acceptable if you know what input will be processed,
but inacceptable if you're handing out your lexer to the general public.


    (I'm using this technique for one of my toy projects. I'm designing a
language, and the first program that's written in a language is
traditionally a compiler for that language. I have a lexer written in C
that uses the array technique which will I'll be using to compile the
compiler until it can compile itself, then I'll throw the C code away.)
Unfortunately, all alternatives to an array will make your code much
bigger or more difficult to debug.


You can use a "streaming" technique: write the lexer so that it will
just return the next token each time it is called. This is in the
"difficult to debug" category, since you'll never have more than one
token at a time (with an array, you'll see more of the lexing history).
Another alternative would be switching to an OO or functional language
with automatic garbage collection and unobtrusive creation of dynamic
data. If you were interested in exploring these paradigms, this is an
excellent opportunity to do so (in particular, functional languages are
very strong in this area).


* You're doing the token recognization "by hand". It is very easy to
overlook bugs and conceptual errors that way. A better approach would be
a lexer generator (lex and its Open Source cousin flex come to mind,
alternatives exist but aren't so universally known). With that, you can
specify the tokens using regular expressions, and the generator will
produce C code that recognizes them. (It will even do the "streaming"
trick for you.)


HTH
Joachim


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.