|Parsing HTML : I would appreciate advice email@example.com (Jim) (2006-11-13)|
|Re: Parsing HTML : I would appreciate advice firstname.lastname@example.org (Daniel Zingaro) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice email@example.com (Vidar Hokstad) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice firstname.lastname@example.org (Manuel Collado) (2006-11-15)|
|[2 later articles]|
|Date:||13 Nov 2006 16:31:40 -0500|
|Posted-Date:||13 Nov 2006 16:31:40 EST|
The problem to solve.
I have to parse millions of html documents, and return just the
plaintext/bytes. Many of the html documents contain Japanese
characters and so it will be necessary to read the codepage in the
html header, so the bytes can be read properly.
Ninety percent of the html documents are well formed, originally
created by code. The rest of the documents are random html documents
from the internet. I will be placing the plaintext in an SQL database
and use full-text search.
What type of parser/lexer is best for parsing html?
Can you offer any links or existing libraries for doing this?
I am most proficient in: C#, C++, C languages.
Las Vegas, NV
[There's a bazillion HTML parsers available. I write most of my stuff
in perl these days, so I like the HTML::Parser package which could
easily do what you want. -John]
Return to the
Search the comp.compilers archives again.