Re: Regular Expressions

torbenm@diku.dk (Torben Ęgidius Mogensen)
12 Oct 2004 00:56:25 -0400

          From comp.compilers

Related articles
Regular Expressions m_j_mather@yahoo.com.au (2004-10-09)
Re: Regular Expressions newsserver_mails@bodden.de (Eric Bodden) (2004-10-12)
Re: Regular Expressions randyhyde@earthlink.net (Randall Hyde) (2004-10-12)
Re: Regular Expressions schmitz@i3s.unice.fr (Sylvain Schmitz) (2004-10-12)
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-12)
Re: Regular Expressions torbenm@diku.dk (2004-10-12)
Re: Regular Expressions dmaze@mit.edu (David Z Maze) (2004-10-12)
Re: Regular Expressions Martin.Ward@durham.ac.uk (Martin Ward) (2004-10-17)
Re: Regular Expressions choksheak@yahoo.com (ChokSheak Lau) (2004-10-21)
Re: regular expressions wendt@CS.ColoState.EDU (1993-03-22)
Regular Expressions rafae1@hp.fciencias.unam.mx (trejo ortiz alejandro augusto) (1995-10-16)
Re: Regular Expressions mnp@compass-da.com (Mitchell Perilstein) (1995-10-23)
[5 later articles]
| List of all articles for this month |
From: torbenm@diku.dk (Torben Ęgidius Mogensen)
Newsgroups: comp.compilers
Date: 12 Oct 2004 00:56:25 -0400
Organization: Department of Computer Science, University of Copenhagen
References: 04-10-069
Keywords: lex
Posted-Date: 12 Oct 2004 00:56:25 EDT

m_j_mather@yahoo.com.au (Mark) writes:


> I just can't seem to figure out how to invent a regular expression
> that will strip all HTML tags (except TABLE tags) out of a string and
> leave the rest of the text. When a TABLE tag is encountered i need to
> strip everything under it.
>
> This will strip all HTML out <[^>]*>


It will recognize all HTML tags, but it depends on the action if they
are left out, preserved, or whatnot. Also, IIRC, you may have >
inside a tag if it is enclosed in quotes, as in <a href=">>">xx</a>


> But how do I make it also strip entire TABLE elements?
>
> Perhaps something like <table[^</table>]*</table>|<[^>]*>


A "^" at the start of a bracket means that none of the characters
following it may appear, so any of <, /, t, a, b, l, e, or > would be
required to be followed by </table>. Also, by listing both table and
non-table on the same line you force them to have the same action (so
either both will be skipped or both will be preserved).


And then there is the possibility for nested tables. If you don't
take care of this, a regular expression will think the outer table has
ended when the inner endtag is read. A regular expression can not
handle arbitray nesting depts, so you would either need to use a
counter in the action of the regular expression or limit yourself to a
fixed limit on the number of nested tables and write a regular
expression for each level of nesting. How this is best done depends
on which tool you use (lex, Perl, etc.).


You could also consider using a parser generator, which eases handling
of matching tags and nested tables.


Torben



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.