Re: Parsing HTML : I would appreciate advice

Manuel Collado <m.collado@fi.upm.es>
15 Nov 2006 00:12:43 -0500

From comp.compilers

Related articles
Parsing HTML : I would appreciate advice jim@aol.com (Jim) (2006-11-13)
Re: Parsing HTML : I would appreciate advice zingard@mcmaster.ca (Daniel Zingaro) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15)
Re: Parsing HTML : I would appreciate advice vidar.hokstad@gmail.com (Vidar Hokstad) (2006-11-15)
Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15)
*Re: Parsing HTML : I would appreciate advice m.collado@fi.upm.es (Manuel Collado)* (2006-11-15)**
Re: Parsing HTML : I would appreciate advice ojh16@student.canterbury.ac.nz (Oliver Hunt) (2006-11-15)
Re: Parsing HTML : I would appreciate advice sorry@nospam.org (Tim Van Holder) (2006-11-18)

| List of all articles for this month |

From:	Manuel Collado <m.collado@fi.upm.es>
Newsgroups:	comp.compilers
Date:	15 Nov 2006 00:12:43 -0500
Organization:	Compilers Central
References:	06-11-059
Keywords:	parse
Posted-Date:	15 Nov 2006 00:12:43 EST

Jim escribió:
> The problem to solve.
>
> I have to parse millions of html documents, and return just the
> plaintext/bytes. Many of the html documents contain Japanese
> characters and so it will be necessary to read the codepage in the
> html header, so the bytes can be read properly.
>
> Ninety percent of the html documents are well formed, originally
> created by code. The rest of the documents are random html documents
> from the internet. I will be placing the plaintext in an SQL database
> and use full-text search.
>
> What type of parser/lexer is best for parsing html?
>
> Can you offer any links or existing libraries for doing this?
>
> I am most proficient in: C#, C++, C languages.

No need to write any code (well, almost). There are a lot of utilities
that can do what you want. A possiblity is to use 'xsltproc' with the
--html flag to apply an XSLT stlesheet that selects:

<xsl:value-of select="/html/body" />

For not well formed pages, you can use 'Tidy' to fix them first.

And better ask in comp.text.xml or comp.infosystems.www.authoring.html.
They are probably more appropriate newsgroups for this subject.

Regards.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Parsing HTML : I would appreciate advice

Manuel Collado <m.collado@fi.upm.es>15 Nov 2006 00:12:43 -0500

Manuel Collado <m.collado@fi.upm.es>
15 Nov 2006 00:12:43 -0500