You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "David M. Hirst" <dh...@ccs.neu.edu> on 2002/04/15 23:07:40 UTC

validating html

Hi,
	I'm using the parser to parse html as well as xml documents. When
I read in an html file, the parser is generating the following error on
me: java.io.FileNotFoundException, and the file that it is trying to open
is http://www.w3.org/TR/WD-html-in-xml/DTD/xhtml1-strict.dtd. I've checked
the w3c site, and this file does not exist at that location. I tried
turning validation off on the parser using the setFeature method, but that
did not solve any problems. I could really use some help on this one. I'm
running close to a deadline.

Thanks in advance



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: validating html

Posted by Andy Clark <an...@apache.org>.
"David M. Hirst" wrote:
>         I'm using the parser to parse html as well as xml documents. When
> I read in an html file, the parser is generating the following error on

Please realize that most HTML documents are *not* well-formed
XML documents and therefore cannot be parsed by any conformant
XML parser. As long as the HTML documents in question are also
well-formed XML documents (e.g. XHTML documents), then you can
follow the suggestions given by Eric and Benson.

However, if you really need to parse HTML documents, then you
need another solution. Two options I can recommend are the
following: JTidy[1] and NekoHTML[2]. 

JTidy is excellent at fixing up HTML documents but must read the 
entire document into memory and can only handle a restricted set 
of character encodings. 

NekoHTML does less but is written directly to the Xerces Native 
Interface (XNI) so it integrates well with Xerces2, can operate 
in a streaming fashion, and handle more character encodings.

[1] http://sourceforge.net/projects/jtidy
[2] http://www.apache.org/~andyc/

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org