Re: Parsing HTML : I would appreciate advice

Tim Van Holder <sorry@nospam.org>
18 Nov 2006 16:20:21 -0500

          From comp.compilers

Related articles
[2 earlier articles]
Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15)
Re: Parsing HTML : I would appreciate advice vidar.hokstad@gmail.com (Vidar Hokstad) (2006-11-15)
Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15)
Re: Parsing HTML : I would appreciate advice m.collado@fi.upm.es (Manuel Collado) (2006-11-15)
Re: Parsing HTML : I would appreciate advice ojh16@student.canterbury.ac.nz (Oliver Hunt) (2006-11-15)
Re: Parsing HTML : I would appreciate advice sorry@nospam.org (Tim Van Holder) (2006-11-18)
| List of all articles for this month |

From: Tim Van Holder <sorry@nospam.org>
Newsgroups: comp.compilers
Date: 18 Nov 2006 16:20:21 -0500
Organization: Compilers Central
References: 06-11-059
Keywords: parse, WWW

Jim wrote:
> The problem to solve.
>
> I have to parse millions of html documents, and return just the
> plaintext/bytes. Many of the html documents contain Japanese
> characters and so it will be necessary to read the codepage in the
> html header, so the bytes can be read properly.


Note that it many cases, only the character set sent in the HTTP
header is actually used, rather than the meta setting. So this at
least will possibly be a headache, regardless of the parsing method
used, as there is in and of itself no guarantee that the file will
identify its encoding properly.


> Ninety percent of the html documents are well formed, originally
> created by code. The rest of the documents are random html documents
> from the internet. I will be placing the plaintext in an SQL database
> and use full-text search.
>
> What type of parser/lexer is best for parsing html?


Given that you just want plaintext, and only for the specific use of
text searches, a hand-written piece of code would probably be enough.


What you need do is
- ignore anything inside certain tags (<head>, <script>, <object>
    come to mind)
- ignore anything between <>
- process entities to a character equivalent
- possibly do things like folding consecutive whitespace, stripping
    punctuation, etc


While there's many HTML parsers available, they may make it harder to
get at the actual text, and would certainly use more memory than a
simple stream processor.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.