Related articles |
---|
Regular Expressions m_j_mather@yahoo.com.au (2004-10-09) |
Re: Regular Expressions newsserver_mails@bodden.de (Eric Bodden) (2004-10-12) |
Re: Regular Expressions randyhyde@earthlink.net (Randall Hyde) (2004-10-12) |
Re: Regular Expressions schmitz@i3s.unice.fr (Sylvain Schmitz) (2004-10-12) |
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-12) |
Re: Regular Expressions torbenm@diku.dk (2004-10-12) |
Re: Regular Expressions dmaze@mit.edu (David Z Maze) (2004-10-12) |
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-17) |
Re: Regular Expressions choksheak@yahoo.com (ChokSheak Lau) (2004-10-21) |
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22) |
Regular Expressions rafae1@hp.fciencias.unam.mx (trejo ortiz alejandro augusto) (1995-10-16) |
Re: Regular Expressions mnp@compass-da.com (Mitchell Perilstein) (1995-10-23) |
[5 later articles] |
From: | torbenm@diku.dk (Torben Ęgidius Mogensen) |
Newsgroups: | comp.compilers |
Date: | 12 Oct 2004 00:56:25 -0400 |
Organization: | Department of Computer Science, University of Copenhagen |
References: | 04-10-069 |
Keywords: | lex |
Posted-Date: | 12 Oct 2004 00:56:25 EDT |
m_j_mather@yahoo.com.au (Mark) writes:
> I just can't seem to figure out how to invent a regular expression
> that will strip all HTML tags (except TABLE tags) out of a string and
> leave the rest of the text. When a TABLE tag is encountered i need to
> strip everything under it.
>
> This will strip all HTML out <[^>]*>
It will recognize all HTML tags, but it depends on the action if they
are left out, preserved, or whatnot. Also, IIRC, you may have >
inside a tag if it is enclosed in quotes, as in <a href=">>">xx</a>
> But how do I make it also strip entire TABLE elements?
>
> Perhaps something like <table[^</table>]*</table>|<[^>]*>
A "^" at the start of a bracket means that none of the characters
following it may appear, so any of <, /, t, a, b, l, e, or > would be
required to be followed by </table>. Also, by listing both table and
non-table on the same line you force them to have the same action (so
either both will be skipped or both will be preserved).
And then there is the possibility for nested tables. If you don't
take care of this, a regular expression will think the outer table has
ended when the inner endtag is read. A regular expression can not
handle arbitray nesting depts, so you would either need to use a
counter in the action of the regular expression or limit yourself to a
fixed limit on the number of nested tables and write a regular
expression for each level of nesting. How this is best done depends
on which tool you use (lex, Perl, etc.).
You could also consider using a parser generator, which eases handling
of matching tags and nested tables.
Torben
Return to the
comp.compilers page.
Search the
comp.compilers archives again.