Re: Regular Expressions

ChokSheak Lau <>
21 Oct 2004 22:30:17 -0400

          From comp.compilers

Related articles
[2 earlier articles]
Re: Regular Expressions (Randall Hyde) (2004-10-12)
Re: Regular Expressions (Sylvain Schmitz) (2004-10-12)
Re: Regular Expressions (Martin Ward) (2004-10-12)
Re: Regular Expressions (2004-10-12)
Re: Regular Expressions (David Z Maze) (2004-10-12)
Re: Regular Expressions (Martin Ward) (2004-10-17)
Re: Regular Expressions (ChokSheak Lau) (2004-10-21)
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22)
Regular Expressions (trejo ortiz alejandro augusto) (1995-10-16)
Re: Regular Expressions (Mitchell Perilstein) (1995-10-23)
Re: Regular Expressions (1995-10-29)
Re: Regular Expressions (Colm O'Dunlaing) (1995-10-31)
Re: Regular Expressions (1995-11-03)
[2 later articles]
| List of all articles for this month |

From: ChokSheak Lau <>
Newsgroups: comp.compilers
Date: 21 Oct 2004 22:30:17 -0400
Organization: Georgia Institute of Technology
References: 04-10-069
Keywords: lex
Posted-Date: 21 Oct 2004 22:30:17 EDT

Mark wrote:
  > Hi everyone
  > I just can't seem to figure out how to invent a regular expression
  > that will strip all HTML tags (except TABLE tags) out of a string and
  > leave the rest of the text. When a TABLE tag is encountered i need to
  > strip everything under it.
  > This will strip all HTML out <[^>]*>
  > But how do I make it also strip entire TABLE elements?
  > Perhaps something like <table[^</table>]*</table>|<[^>]*>

Hi Mark,

as others have pointed out, the HTML thing is context-free so you
can't use a regex to fully capture it (100% of the time). however,
you can use Perl-like regexes to filter out everything you don't want.

anyway, just to illustrate a little, in Perl (the code has not been
so please assume they don't work):

1. find all <table> tags
$s =~ m/<table[^>]*>.*"</table>"?/i;

2. stripping all tags
$s =~ s/<(\w+)[^>]*>([^<]*)</\1>/$2/i;

so what does that mean? roughly speaking, iterate on the same string
until you can't find any more <table> tags, then strip all tags within
the pre-match and post-match strings until you're done. there are many
details left to be figured out.

this approach will not always work, but most of the time it will.
if we're looking at a commercial product here, then use a real HTML


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.