You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Andreas Funke <an...@s2002.tu-chemnitz.de> on 2007/12/03 23:35:35 UTC

HTML parsing

Hello,

i'm pretty new in using xerces-c.
I'd like to use it in a project, that should be able to handle with xml files 
and also with html files. 
Can anybody tell me, what have to be done for this, or a reference side, where 
i can find samples for using Xerces to parse html files? The standart xerces 
samples, that come with the installation, just semms to handle with xml. 
Is it a good choice to parse html files with a dom parser at all, or would it 
be better to use sax for that? I know, there is this problem with the 
wellformig of html, and so i wonder, if i shouldn't use another, more 
tolerant parser. 

Thanks 
Andreas

Re: HTML parsing

Posted by David Bertoni <db...@apache.org>.
Andreas Funke wrote:
> Hello,
> 
> i'm pretty new in using xerces-c.
> I'd like to use it in a project, that should be able to handle with xml files 
> and also with html files. 
Xerces-C is an XML parser, and many HTML documents are not well-formed XML.

> Can anybody tell me, what have to be done for this, or a reference side, where 
> i can find samples for using Xerces to parse html files? The standart xerces 
> samples, that come with the installation, just semms to handle with xml. 
> Is it a good choice to parse html files with a dom parser at all, or would it 
> be better to use sax for that? I know, there is this problem with the 
> wellformig of html, and so i wonder, if i shouldn't use another, more 
> tolerant parser. 
You need an HTML parser, or something like NekoHTML 
(http://people.apache.org/~andyc/neko/doc/html/), which attempts to turn 
HTML into well-formed XML.

Conforming XML parsers are not allowed to ignore well-formedness errors.

Dave