Looking for a parser for SGML searches

kboehm@darmstadt.gmd.de (Klemens Boehm)
Tue, 21 Dec 1993 12:12:58 GMT

          From comp.compilers

Related articles
Looking for a parser for SGML searches kboehm@darmstadt.gmd.de (1993-12-21)
Re: Looking for a parser for SGML searches bruce@sgml.dircon.co.uk (Bruce Hunter) (1993-12-23)
| List of all articles for this month |

Newsgroups: comp.compilers
From: kboehm@darmstadt.gmd.de (Klemens Boehm)
Keywords: parse, question, comment
Organization: German National Research Center for Computer Science (GMD)
Date: Tue, 21 Dec 1993 12:12:58 GMT

Hello,


In connection with storage of textual documents we have the following problem:


SGML ('Standard Generalized Markup Language') is a standard allowing the
definition of document types. The structure of SCIENTIFIC_PUBLICATION, to
give an example, may look as follows:'ABSTRACT?, INTRODUCTION, SECTION+,
CONCLUSION'. Thus, what one does is - basically - defining a regular
expression by which the structure of the (nonterminal) document components
is fixed. We are looking for a tool whose first parameter is such a
regular expression and whose second one is a sequence such as
'INTRODUCTION SECTION SECTION SECTION CONCLUSION'. As a result, one
obtains the information whether the second parameter conforms to the first
one, i.e. the regular expression. We definitely do not want to have a
parser for each regular expression.


I reckon that there is a broad variety of tools doing this. There are,
however, some more requirements. In SGML there are the following
constructors:


a,b a followed by b
a? a occurs at most once
a* a occurs arbitrarily often
a+ a occurs at least once
a|b a or b, but not both
a&b a and b occur, but their order is not specified
(a&b&c), for example, is equivalent to
(a,b,c)|(a,c,b)|(b,a,c)|(b,c,a)|(c,a,b)|(c,b,a)


We think that most tools will have trouble with the last constructor.
Transforming &-expressions in that equivalent form has the obvious
disadvantage that with more than four or five components the equivalent
expressions become too long. However, we would be ready to live with that.


Furthermore, in SGML there exists the possibility to define so-called
inclusions. For instance, SCIENTIFIC_PUBLICATION could also look as
follows: '(ABSTRACT?, INTRODUCTION, SECTION+, CONCLUSION) +FOOTNOTE'.
This means that the components of SCIENTIFIC_PUBLICATION may contain an
element of type FOOTNOTE at an arbitrary position. 'INTRODUCTION SECTION
SECTION FOOTNOTE SECTION CONCLUSION' would be a possible instance, because
FOOTNOTE is legitimated by the inclusion definition.


Of course, we do not want to write a parser ourselves, but rely as much as
possible on software that already exists. Any hint is sincerely
appreciated. (For instance, if this is the wrong newsgroup for this
posting, I'd like to know which one would be more appropriate.) In case
you are about to answer, please send a copy of your reply to my personal
e-mail account (kboehm@darmstadt.gmd.de).


Your interest is very much appreciated.


Best regards,
Klemens Boehm
[If you're planning to do document searches this way, I'd think that it'd
be impossibly slow to scan each document. Text search systems tend to keep
indices of words in documents, so the search turns into lookups in the word
lists, followed by set algebra on the matches to figure out which documents
qualify. -John]
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.