|[2 earlier articles]|
|Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice firstname.lastname@example.org (Vidar Hokstad) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice email@example.com (Manuel Collado) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice firstname.lastname@example.org (Oliver Hunt) (2006-11-15)|
|Re: Parsing HTML : I would appreciate advice email@example.com (Tim Van Holder) (2006-11-18)|
|From:||Tim Van Holder <firstname.lastname@example.org>|
|Date:||18 Nov 2006 16:20:21 -0500|
|Posted-Date:||18 Nov 2006 16:20:21 EST|
> The problem to solve.
> I have to parse millions of html documents, and return just the
> plaintext/bytes. Many of the html documents contain Japanese
> characters and so it will be necessary to read the codepage in the
> html header, so the bytes can be read properly.
Note that it many cases, only the character set sent in the HTTP
header is actually used, rather than the meta setting. So this at
least will possibly be a headache, regardless of the parsing method
used, as there is in and of itself no guarantee that the file will
identify its encoding properly.
> Ninety percent of the html documents are well formed, originally
> created by code. The rest of the documents are random html documents
> from the internet. I will be placing the plaintext in an SQL database
> and use full-text search.
> What type of parser/lexer is best for parsing html?
Given that you just want plaintext, and only for the specific use of
text searches, a hand-written piece of code would probably be enough.
What you need do is
- ignore anything inside certain tags (<head>, <script>, <object>
come to mind)
- ignore anything between <>
- process entities to a character equivalent
- possibly do things like folding consecutive whitespace, stripping
While there's many HTML parsers available, they may make it harder to
get at the actual text, and would certainly use more memory than a
simple stream processor.
Return to the
Search the comp.compilers archives again.