Re: Regular Expressions

ChokSheak Lau <choksheak@yahoo.com>
21 Oct 2004 22:30:17 -0400

          From comp.compilers

Related articles
[2 earlier articles]
Re: Regular Expressions randyhyde@earthlink.net (Randall Hyde) (2004-10-12)
Re: Regular Expressions schmitz@i3s.unice.fr (Sylvain Schmitz) (2004-10-12)
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-12)
Re: Regular Expressions torbenm@diku.dk (2004-10-12)
Re: Regular Expressions dmaze@mit.edu (David Z Maze) (2004-10-12)
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-17)
Re: Regular Expressions choksheak@yahoo.com (ChokSheak Lau) (2004-10-21)
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22)
Regular Expressions rafae1@hp.fciencias.unam.mx (trejo ortiz alejandro augusto) (1995-10-16)
Re: Regular Expressions mnp@compass-da.com (Mitchell Perilstein) (1995-10-23)
Re: Regular Expressions cgh@cs.rice.edu (1995-10-29)
Re: Regular Expressions odunlain@maths.tcd.ie (Colm O'Dunlaing) (1995-10-31)
Re: Regular Expressions natasha@softlab.ece.ntua.gr (1995-11-03)
[2 later articles]
| List of all articles for this month |
From: ChokSheak Lau <choksheak@yahoo.com>
Newsgroups: comp.compilers
Date: 21 Oct 2004 22:30:17 -0400
Organization: Georgia Institute of Technology
References: 04-10-069
Keywords: lex
Posted-Date: 21 Oct 2004 22:30:17 EDT

Mark wrote:
  > Hi everyone
  >
  > I just can't seem to figure out how to invent a regular expression
  > that will strip all HTML tags (except TABLE tags) out of a string and
  > leave the rest of the text. When a TABLE tag is encountered i need to
  > strip everything under it.
  >
  > This will strip all HTML out <[^>]*>
  >
  > But how do I make it also strip entire TABLE elements?
  >
  > Perhaps something like <table[^</table>]*</table>|<[^>]*>


Hi Mark,


as others have pointed out, the HTML thing is context-free so you
can't use a regex to fully capture it (100% of the time). however,
you can use Perl-like regexes to filter out everything you don't want.


anyway, just to illustrate a little, in Perl (the code has not been
debugged,
so please assume they don't work):


1. find all <table> tags
$s =~ m/<table[^>]*>.*"</table>"?/i;


2. stripping all tags
$s =~ s/<(\w+)[^>]*>([^<]*)</\1>/$2/i;


so what does that mean? roughly speaking, iterate on the same string
until you can't find any more <table> tags, then strip all tags within
the pre-match and post-match strings until you're done. there are many
details left to be figured out.


this approach will not always work, but most of the time it will.
if we're looking at a commercial product here, then use a real HTML
parser.


chok


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.