Parsing HTML : I would appreciate advice

"Jim" <jim@aol.com>
13 Nov 2006 16:31:40 -0500

From comp.compilers

Related articles
*Parsing HTML : I would appreciate advice jim@aol.com (Jim)* (2006-11-13)**
Re: Parsing HTML : I would appreciate advice zingard@mcmaster.ca (Daniel Zingaro) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15)
Re: Parsing HTML : I would appreciate advice vidar.hokstad@gmail.com (Vidar Hokstad) (2006-11-15)
Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15)
Re: Parsing HTML : I would appreciate advice m.collado@fi.upm.es (Manuel Collado) (2006-11-15)
[2 later articles]

| List of all articles for this month |

From:	"Jim" <jim@aol.com>
Newsgroups:	comp.compilers
Date:	13 Nov 2006 16:31:40 -0500
Organization:	Cox
Keywords:	parse, comment
Posted-Date:	13 Nov 2006 16:31:40 EST

The problem to solve.

I have to parse millions of html documents, and return just the
plaintext/bytes. Many of the html documents contain Japanese
characters and so it will be necessary to read the codepage in the
html header, so the bytes can be read properly.

Ninety percent of the html documents are well formed, originally
created by code. The rest of the documents are random html documents
from the internet. I will be placing the plaintext in an SQL database
and use full-text search.

What type of parser/lexer is best for parsing html?

Can you offer any links or existing libraries for doing this?

I am most proficient in: C#, C++, C languages.

Thanks
Russell Mangel
Las Vegas, NV
[There's a bazillion HTML parsers available. I write most of my stuff
in perl these days, so I like the HTML::Parser package which could
easily do what you want. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Parsing HTML : I would appreciate advice

"Jim" <jim@aol.com>13 Nov 2006 16:31:40 -0500

"Jim" <jim@aol.com>
13 Nov 2006 16:31:40 -0500