Techniques for "plain text" structure parsing? (Magnus Lie Hetland)
21 Nov 2003 00:57:04 -0500

          From comp.compilers

Related articles
Techniques for "plain text" structure parsing? (2003-11-21)
| List of all articles for this month |

From: (Magnus Lie Hetland)
Newsgroups: comp.compilers
Date: 21 Nov 2003 00:57:04 -0500
Organization: Norwegian university of science and technology
Keywords: parse, question, comment
Posted-Date: 21 Nov 2003 00:57:04 EST

I'm working on a parser generator for parsing plain text, that is,
text documents without any (or with very little) explicit markup,
recovering as much as possible of its logical structure, for the
purpose of adding markup. In other words -- the purpose is turning
plain text documents into XML, with the specific syntax/structure
being configurable.

I've been contemplating many different approaches, among them using
several regular expressions for marking up different features of the
text, and using a full LL(1) parser with backtracking, with regexps at
the token level. But it seems that most mainstream parser techniques
aren't really geared toward this sort of thing (not surprisingly);
even such a commonplace thing as tokenizing isn't exactly
straightforward here (or, at least, so it seems to me).

It also seems that some kinds of features (such as the heading
structure scheme of reStructuredText, for example) aren't even context
free, let alone regular; here the level of a given heading marker
depends on which ones you've used earlier in the document.

So, I just wondered if anyone has any suggestions for where I should
look and what techniques I should consider more closely.

Magnus Lie Hetland "In this house we obey the laws of thermodynamics!" Homer Simpson
[People have been working on the problem since at least the early 1970s
and probably before that. I'd go visit the library. I agree that
the parsing problem isn't well suited to CFGs. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.