From: | David Z Maze <dmaze@mit.edu> |
Newsgroups: | comp.compilers |
Date: | Mon, 17 Nov 2008 10:26:23 -0500 |
Organization: | Massachusetts Institute of Technology |
References: | 08-11-061 |
Keywords: | parse, XML |
Posted-Date: | 17 Nov 2008 18:30:38 EST |
"tuxisthebirdforme@gmail.com" <tuxisthebirdforme@gmail.com> writes:
> I know most people anymore use lex/yacc or some derivative of these
> tools to create scanner/parser modules for their compiler projects. I
> was wondering if anyone has developed a scanner or parser that they
> personally hand-wrote? If so, I would like to know what language you
> used and what type of grammar you parsed.
In my copious free time, I hand-wrote (sort of) a parser for XML 1.0
(non-validating, ignores character-set issues, rejects DTDs, does do
namespaces) a month or two ago. I wrote this in Haskell using the
Parsec support library, and generated a straightforward tree
representation of the XML. I say "sort of hand-wrote" in that Parsec
isn't really a parser generator in the same sense that yacc is; also, a
lot of its functionality could be better expressed in modern Haskell
extensions like arrows and the Control.Applicative module that post-date
Parsec.
At any rate, this is an LL(0) implementation, with appropriate context
checking for duplicate attributes and tag matching. Since Haskell
supports functions as first-class objects, I can turn a grammar fragment
like
document ::= xml-declaration?
(whitespace | processing-instruction | comment)*
element
(whitespace | processing-instruction | comment)*
into (syntax approximate, ignoring many issues)
document :: Parser XMLDocument
document = do optional xmldeclaration
pre <- many (whitespace <|> pi <|> comment)
elt <- element
post <- many (whitespace <|> pi <|> comment)
return $ XMLDocument (pre ++ [elt] ++ post)
xmldeclaration :: Parser ()
xmldeclaration = do string "<?xml"
-- stuff
string "?>"
-- etc.
where all of the above is *code*, not a description that needs to be
preprocessed. The only tricky thing is refactoring the grammar into
LL(0) form since otherwise Parsec will pick up the '<' character for the
obvious construction of processing instructions (<?name ... ?>) and then
complain when it doesn't see the '?' for comments (<!--... -->) or
elements (<name>).
(Also there is some amount of wrapping your head around Haskell, of
course; a lot of deep magic is hidden in that "do".)
HTH,
--dzm
Return to the
comp.compilers page.
Search the
comp.compilers archives again.