Re: Parsing HTML : I would appreciate advice

Juergen Kahrs <Juergen.KahrsDELETETHIS@vr-web.de>
15 Nov 2006 00:11:55 -0500

From comp.compilers

Related articles
Parsing HTML : I would appreciate advice jim@aol.com (Jim) (2006-11-13)
Re: Parsing HTML : I would appreciate advice zingard@mcmaster.ca (Daniel Zingaro) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15)
Re: Parsing HTML : I would appreciate advice vidar.hokstad@gmail.com (Vidar Hokstad) (2006-11-15)
*Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs)* (2006-11-15)**
Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15)
Re: Parsing HTML : I would appreciate advice m.collado@fi.upm.es (Manuel Collado) (2006-11-15)
Re: Parsing HTML : I would appreciate advice ojh16@student.canterbury.ac.nz (Oliver Hunt) (2006-11-15)
Re: Parsing HTML : I would appreciate advice sorry@nospam.org (Tim Van Holder) (2006-11-18)

| List of all articles for this month |

From:	Juergen Kahrs <Juergen.KahrsDELETETHIS@vr-web.de>
Newsgroups:	comp.compilers
Date:	15 Nov 2006 00:11:55 -0500
Organization:	Compilers Central
References:	06-11-059
Keywords:	parse
Posted-Date:	15 Nov 2006 00:11:54 EST

Jim wrote:

> I have to parse millions of html documents, and return just the
> plaintext/bytes. Many of the html documents contain Japanese
> characters and so it will be necessary to read the codepage in the
> html header, so the bytes can be read properly.

Use "lynx -dump". w3m can also do this.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Parsing HTML : I would appreciate advice

Juergen Kahrs <Juergen.KahrsDELETETHIS@vr-web.de>15 Nov 2006 00:11:55 -0500

Juergen Kahrs <Juergen.KahrsDELETETHIS@vr-web.de>
15 Nov 2006 00:11:55 -0500