You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Andreas Funke <an...@s2002.tu-chemnitz.de> on 2007/12/03 23:35:35 UTC
HTML parsing
Hello,
i'm pretty new in using xerces-c.
I'd like to use it in a project, that should be able to handle with xml files
and also with html files.
Can anybody tell me, what have to be done for this, or a reference side, where
i can find samples for using Xerces to parse html files? The standart xerces
samples, that come with the installation, just semms to handle with xml.
Is it a good choice to parse html files with a dom parser at all, or would it
be better to use sax for that? I know, there is this problem with the
wellformig of html, and so i wonder, if i shouldn't use another, more
tolerant parser.
Thanks
Andreas
Re: HTML parsing
Posted by David Bertoni <db...@apache.org>.
Andreas Funke wrote:
> Hello,
>
> i'm pretty new in using xerces-c.
> I'd like to use it in a project, that should be able to handle with xml files
> and also with html files.
Xerces-C is an XML parser, and many HTML documents are not well-formed XML.
> Can anybody tell me, what have to be done for this, or a reference side, where
> i can find samples for using Xerces to parse html files? The standart xerces
> samples, that come with the installation, just semms to handle with xml.
> Is it a good choice to parse html files with a dom parser at all, or would it
> be better to use sax for that? I know, there is this problem with the
> wellformig of html, and so i wonder, if i shouldn't use another, more
> tolerant parser.
You need an HTML parser, or something like NekoHTML
(http://people.apache.org/~andyc/neko/doc/html/), which attempts to turn
HTML into well-formed XML.
Conforming XML parsers are not allowed to ignore well-formedness errors.
Dave