Related articles |
---|
[2 earlier articles] |
Re: Regular Expressions randyhyde@earthlink.net (Randall Hyde) (2004-10-12) |
Re: Regular Expressions schmitz@i3s.unice.fr (Sylvain Schmitz) (2004-10-12) |
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-12) |
Re: Regular Expressions torbenm@diku.dk (2004-10-12) |
Re: Regular Expressions dmaze@mit.edu (David Z Maze) (2004-10-12) |
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-17) |
Re: Regular Expressions choksheak@yahoo.com (ChokSheak Lau) (2004-10-21) |
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22) |
Regular Expressions rafae1@hp.fciencias.unam.mx (trejo ortiz alejandro augusto) (1995-10-16) |
Re: Regular Expressions mnp@compass-da.com (Mitchell Perilstein) (1995-10-23) |
Re: Regular Expressions cgh@cs.rice.edu (1995-10-29) |
Re: Regular Expressions odunlain@maths.tcd.ie (Colm O'Dunlaing) (1995-10-31) |
Re: Regular Expressions natasha@softlab.ece.ntua.gr (1995-11-03) |
[2 later articles] |
From: | ChokSheak Lau <choksheak@yahoo.com> |
Newsgroups: | comp.compilers |
Date: | 21 Oct 2004 22:30:17 -0400 |
Organization: | Georgia Institute of Technology |
References: | 04-10-069 |
Keywords: | lex |
Posted-Date: | 21 Oct 2004 22:30:17 EDT |
Mark wrote:
> Hi everyone
>
> I just can't seem to figure out how to invent a regular expression
> that will strip all HTML tags (except TABLE tags) out of a string and
> leave the rest of the text. When a TABLE tag is encountered i need to
> strip everything under it.
>
> This will strip all HTML out <[^>]*>
>
> But how do I make it also strip entire TABLE elements?
>
> Perhaps something like <table[^</table>]*</table>|<[^>]*>
Hi Mark,
as others have pointed out, the HTML thing is context-free so you
can't use a regex to fully capture it (100% of the time). however,
you can use Perl-like regexes to filter out everything you don't want.
anyway, just to illustrate a little, in Perl (the code has not been
debugged,
so please assume they don't work):
1. find all <table> tags
$s =~ m/<table[^>]*>.*"</table>"?/i;
2. stripping all tags
$s =~ s/<(\w+)[^>]*>([^<]*)</\1>/$2/i;
so what does that mean? roughly speaking, iterate on the same string
until you can't find any more <table> tags, then strip all tags within
the pre-match and post-match strings until you're done. there are many
details left to be figured out.
this approach will not always work, but most of the time it will.
if we're looking at a commercial product here, then use a real HTML
parser.
chok
Return to the
comp.compilers page.
Search the
comp.compilers archives again.