Related articles |
---|
Parsing HTML : I would appreciate advice jim@aol.com (Jim) (2006-11-13) |
Re: Parsing HTML : I would appreciate advice zingard@mcmaster.ca (Daniel Zingaro) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice JustinBl@osiristrading.com (excalibur2000) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice vidar.hokstad@gmail.com (Vidar Hokstad) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice Juergen.KahrsDELETETHIS@vr-web.de (Juergen Kahrs) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice JoachimPimiskern@web.de (Joachim Pimiskern) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice m.collado@fi.upm.es (Manuel Collado) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice ojh16@student.canterbury.ac.nz (Oliver Hunt) (2006-11-15) |
Re: Parsing HTML : I would appreciate advice sorry@nospam.org (Tim Van Holder) (2006-11-18) |
From: | "Vidar Hokstad" <vidar.hokstad@gmail.com> |
Newsgroups: | comp.compilers |
Date: | 15 Nov 2006 00:11:45 -0500 |
Organization: | Compilers Central |
References: | 06-11-059 |
Keywords: | parse |
Posted-Date: | 15 Nov 2006 00:11:45 EST |
Jim wrote:
> What type of parser/lexer is best for parsing html?
>
> Can you offer any links or existing libraries for doing this?
>
> I am most proficient in: C#, C++, C languages.
libxml2 (http://www.xmlsoft.org/) is very fast, and has a HTML parser.
It also has decent support for i18n. The downside is that it's poorly
documented and the API is fairly ugly. The upside is that it's very
flexible. It's in C, but there are various C++ wrappers too.
But it really depends on what you want to do. If the goal is "only" to
extract the plaintext I'd be tempted to write the tokenizer manually,
as using a full HTML parser is often overkill for that kind of
application.
Tokenizing HTML is fairly simple, though you should take care to be
more liberal than the spec if you want to be able to handle random
HTML pages from the net (for example, you shouldn't expect attributes
to be properly quoted, nor should you expect tags to be correctly
nested).
Vidar
Return to the
comp.compilers page.
Search the
comp.compilers archives again.