Re: Parsing HTML : I would appreciate advice

"Vidar Hokstad" <vidar.hokstad@gmail.com>
15 Nov 2006 00:11:45 -0500

          From comp.compilers

Related articles
Parsing HTML : I would appreciate advice jim@aol.com (Jim) (2006-11-13)
Re: Parsing HTML : I would appreciate advice zingard@mcmaster.ca (Daniel Zingaro) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15)
Re: Parsing HTML : I would appreciate advice vidar.hokstad@gmail.com (Vidar Hokstad) (2006-11-15)
Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15)
Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15)
Re: Parsing HTML : I would appreciate advice m.collado@fi.upm.es (Manuel Collado) (2006-11-15)
Re: Parsing HTML : I would appreciate advice ojh16@student.canterbury.ac.nz (Oliver Hunt) (2006-11-15)
Re: Parsing HTML : I would appreciate advice sorry@nospam.org (Tim Van Holder) (2006-11-18)
| List of all articles for this month |

From: "Vidar Hokstad" <vidar.hokstad@gmail.com>
Newsgroups: comp.compilers
Date: 15 Nov 2006 00:11:45 -0500
Organization: Compilers Central
References: 06-11-059
Keywords: parse

Jim wrote:
> What type of parser/lexer is best for parsing html?
>
> Can you offer any links or existing libraries for doing this?
>
> I am most proficient in: C#, C++, C languages.


libxml2 (http://www.xmlsoft.org/) is very fast, and has a HTML parser.
It also has decent support for i18n. The downside is that it's poorly
documented and the API is fairly ugly. The upside is that it's very
flexible. It's in C, but there are various C++ wrappers too.


But it really depends on what you want to do. If the goal is "only" to
extract the plaintext I'd be tempted to write the tokenizer manually,
as using a full HTML parser is often overkill for that kind of
application.


Tokenizing HTML is fairly simple, though you should take care to be
more liberal than the spec if you want to be able to handle random
HTML pages from the net (for example, you shouldn't expect attributes
to be properly quoted, nor should you expect tags to be correctly
nested).


Vidar


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.