Related articles |
---|
Parsing HTML : I would appreciate advice jim@aol.com (Jim) (2006-11-13) |
Re: Parsing HTML : I would appreciate advice zingard@mcmaster.ca (Daniel Zingaro) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice vidar.hokstad@gmail.com (Vidar Hokstad) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice m.collado@fi.upm.es (Manuel Collado) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice ojh16@student.canterbury.ac.nz (Oliver Hunt) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice sorry@nospam.org (Tim Van Holder) (2006-11-18) |
From: | Manuel Collado <m.collado@fi.upm.es> |
Newsgroups: | comp.compilers |
Date: | 15 Nov 2006 00:12:43 -0500 |
Organization: | Compilers Central |
References: | 06-11-059 |
Keywords: | parse |
Posted-Date: | 15 Nov 2006 00:12:43 EST |
Jim escribió:
> The problem to solve.
>
> I have to parse millions of html documents, and return just the
> plaintext/bytes. Many of the html documents contain Japanese
> characters and so it will be necessary to read the codepage in the
> html header, so the bytes can be read properly.
>
> Ninety percent of the html documents are well formed, originally
> created by code. The rest of the documents are random html documents
> from the internet. I will be placing the plaintext in an SQL database
> and use full-text search.
>
> What type of parser/lexer is best for parsing html?
>
> Can you offer any links or existing libraries for doing this?
>
> I am most proficient in: C#, C++, C languages.
No need to write any code (well, almost). There are a lot of utilities
that can do what you want. A possiblity is to use 'xsltproc' with the
--html flag to apply an XSLT stlesheet that selects:
<xsl:value-of select="/html/body" />
For not well formed pages, you can use 'Tidy' to fix them first.
And better ask in comp.text.xml or comp.infosystems.www.authoring.html.
They are probably more appropriate newsgroups for this subject.
Regards.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
Return to the
comp.compilers page.
Search the
comp.compilers archives again.