Related articles |
---|
Regular Expressions m_j_mather@yahoo.com.au (2004-10-09) |
Re: Regular Expressions newsserver_mails@bodden.de (Eric Bodden) (2004-10-12) |
Re: Regular Expressions randyhyde@earthlink.net (Randall Hyde) (2004-10-12) |
Re: Regular Expressions schmitz@i3s.unice.fr (Sylvain Schmitz) (2004-10-12) |
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-12) |
Re: Regular Expressions torbenm@diku.dk (2004-10-12) |
Re: Regular Expressions dmaze@mit.edu (David Z Maze) (2004-10-12) |
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-17) |
Re: Regular Expressions choksheak@yahoo.com (ChokSheak Lau) (2004-10-21) |
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22) |
Regular Expressions rafae1@hp.fciencias.unam.mx (trejo ortiz alejandro augusto) (1995-10-16) |
Re: Regular Expressions mnp@compass-da.com (Mitchell Perilstein) (1995-10-23) |
Re: Regular Expressions cgh@cs.rice.edu (1995-10-29) |
[4 later articles] |
From: | David Z Maze <dmaze@mit.edu> |
Newsgroups: | comp.compilers |
Date: | 12 Oct 2004 00:56:48 -0400 |
Organization: | Compilers Central |
References: | 04-10-069 |
Keywords: | lex |
Posted-Date: | 12 Oct 2004 00:56:48 EDT |
m_j_mather@yahoo.com.au (Mark) writes:
> I just can't seem to figure out how to invent a regular expression
> that will strip all HTML tags (except TABLE tags) out of a string
> and leave the rest of the text. When a TABLE tag is encountered i
> need to strip everything under it.
If your HTML happens to be well-formed XML, then you could do this
very easily with an XSLT [1] stylesheet:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template select="table"/>
</xsl:stylesheet>
The default behavior of XSLT is pretty much just to strip tags; here,
you're adding a template that says "when you find a table, do
nothing", including not recursing into children to print text.
> But how do I make it also strip entire TABLE elements?
Assuming you don't have nested tables, you could replace
<((table>.*?</table)|(.{1-4})|(tabl[^e])|(table[^>]+))>
with the empty string (using Perl regexp syntax, so that .*? is a
"non-greedy" match-everything). If you could have tables within your
table cells, the problem is equivalent to the paren-matching problem
and a regexp isn't powerful enough.
--dzm
[1] http://www.w3.org/TR/xslt
Return to the
comp.compilers page.
Search the
comp.compilers archives again.